Group contribution (GC) models are powerful, simple, and popular methods for property prediction. However, the most accessible and computationally efficient GC methods, like the Joback and Reid (JR) GC models, often exhibit severe systematic bias. Furthermore, most GC methods do not have uncertainty estimates associated with their predictions. The present work develops a hybrid method for property prediction that integrates GC models with Gaussian process (GP) regression. Predictions from the JR GC method, along with the molecular weight, are used as input features to the GP models, which learn and correct the systematic biases in the GC predictions, resulting in highly accurate property predictions with reliable uncertainty estimates. The method was applied to six properties: normal boiling temperature (Tb), enthalpy of vaporization at Tb (ΔHvap), normal melting temperature (Tm), critical pressure (Pc), critical molar volume (Vc), and critical temperature (Tc). The CRC Handbook of Chemistry and Physics was used as the primary source of experimental data. The final collected experimental data ranged from 485 molecules for ΔHvap to 5640 for Tm. The proposed GCGP method significantly improved property prediction accuracy compared to the GC-only method. The coefficient of determination (R2) values of the testing set predictions are ≥ 0.85 for five out of six and ≥ 0.90 for four out of six properties modeled, and compare favorably with other methods in the literature. Tm was used to demonstrate one way the GCGP method can be tuned for even better predictive accuracy. The GCGP method provides reliable uncertainty estimates and computational efficiency for making new predictions. The GCGP method proved robust to variations in GP model architecture and kernel choice.
more »
« less
This content will become publicly available on January 1, 2027
Enhanced thermophysical property prediction with uncertainty quantification using group contribution-Gaussian process regression
Group contribution (GC) models are powerful, simple, and popular methods for property prediction. However, the most accessible and computationally efficient GC methods, like the Joback and Reid (JR) GC models, often exhibit severe systematic bias. Furthermore, most GC methods do not have uncertainty estimates associated with their predictions. The present work develops a hybrid method for property prediction that integrates GC models with Gaussian process (GP) regression. Predictions from the JR GC method, along with the molecular weight, are used as input features to the GP models, which learn and correct the systematic biases in the GC predictions, resulting in highly accurate property predictions with reliable uncertainty estimates. The method was applied to six properties: normal boiling temperature (Tb), enthalpy of vaporization at Tb (ΔHvap), normal melting temperature (Tm), critical pressure (Pc), critical molar volume (Vc), and critical temperature (Tc). The CRC Handbook of Chemistry and Physics was used as the primary source of experimental data. The final collected experimental data ranged from 485 molecules for ΔHvap to 5640 for Tm. The proposed GCGP method significantly improved property prediction accuracy compared to the GC-only method. The coefficient of determination (R2) values of the testing set predictions are ≥0.85 for five out of six and ≥0.90 for four out of six properties modeled, and compare favorably with other methods in the literature. Tm was used to demonstrate one way the GCGP method can be tuned for even better predictive accuracy. The GCGP method provides reliable uncertainty estimates and computational efficiency for making new predictions. The GCGP method proved robust to variations in GP model architecture and kernel choice.
more »
« less
- Award ID(s):
- 2330175
- PAR ID:
- 10654702
- Publisher / Repository:
- Royal Society of Chemistry
- Date Published:
- Journal Name:
- Molecular Systems Design & Engineering
- ISSN:
- 2058-9689
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Abstract Radiogenomics uses machine-learning (ML) to directly connect the morphologic and physiological appearance of tumors on clinical imaging with underlying genomic features. Despite extensive growth in the area of radiogenomics across many cancers, and its potential role in advancing clinical decision making, no published studies have directly addressed uncertainty in these model predictions. We developed a radiogenomics ML model to quantify uncertainty using transductive Gaussian Processes (GP) and a unique dataset of 95 image-localized biopsies with spatially matched MRI from 25 untreated Glioblastoma (GBM) patients. The model generated predictions for regional EGFR amplification status (a common and important target in GBM) to resolve the intratumoral genetic heterogeneity across each individual tumor—a key factor for future personalized therapeutic paradigms. The model used probability distributions for each sample prediction to quantify uncertainty, and used transductive learning to reduce the overall uncertainty. We compared predictive accuracy and uncertainty of the transductive learning GP model against a standard GP model using leave-one-patient-out cross validation. Additionally, we used a separate dataset containing 24 image-localized biopsies from 7 high-grade glioma patients to validate the model. Predictive uncertainty informed the likelihood of achieving an accurate sample prediction. When stratifying predictions based on uncertainty, we observed substantially higher performance in the group cohort (75% accuracy, n = 95) and amongst sample predictions with the lowest uncertainty (83% accuracy, n = 72) compared to predictions with higher uncertainty (48% accuracy, n = 23), due largely to data interpolation (rather than extrapolation). On the separate validation set, our model achieved 78% accuracy amongst the sample predictions with lowest uncertainty. We present a novel approach to quantify radiogenomics uncertainty to enhance model performance and clinical interpretability. This should help integrate more reliable radiogenomics models for improved medical decision-making.more » « less
-
Abstract For CASP14, we developed deep learning‐based methods for predicting homo‐oligomeric and hetero‐oligomeric contacts and used them for oligomer modeling. To build structure models, we developed an oligomer structure generation method that utilizes predicted interchain contacts to guide iterative restrained minimization from random backbone structures. We supplemented this gradient‐based fold‐and‐dock method with template‐based andab initiodocking approaches using deep learning‐based subunit predictions on 29 assembly targets. These methods produced oligomer models with summed Z‐scores 5.5 units higher than the next best group, with the fold‐and‐dock method having the best relative performance. Over the eight targets for which this method was used, the best of the five submitted models had average oligomer TM‐score of 0.71 (average oligomer TM‐score of the next best group: 0.64), and explicit modeling of inter‐subunit interactions improved modeling of six out of 40 individual domains (ΔGDT‐TS > 2.0).more » « less
-
Refractory complex concentrated alloys (RCCAs) are a relatively new class of materials that can exhibit excellent mechanical properties at high temperatures, and determining their melting temperature (Tm) is critical to assess their range of operation. Unfortunately, the experimental determination of this property is challenging and computational tools to predict the Tm of RCCAs from first-principles calculations are highly desirable. We quantify the uncertainties associated with such predictions for two methods that can be used with density functional theory-based molecular dynamics and apply them to predict the melting temperature of equiatomic NbMoTaW. We find that a combination of free energy calculations of individual phases with a dynamical coexistence method can provide accurate results with the minimum possible computational cost. We predict the melting temperature for the RCCA NbMoTaW to be between 3000 and 3100 K.more » « less
-
When rheological models of polymer blends are used for inverse modeling, they can characterize polymer mixtures from rheological observations. This requires repeated evaluation of potentially expensive rheological models. We explored surrogate models based on Gaussian processes (GP-SM) as a cheaper alternative for describing the rheology of polydisperse binary blends. We used the time-dependent diffusion double reptation (TDD-DR) model as the true model; it takes a 5-dimensional input vector specifying the binary blend as input and yields a function called the relaxation spectrum as output. We used the TDD-DR model to generate training data of different sizes [Formula: see text], via Latin hypercube sampling. The optimal values of the GP-SM hyper-parameters, assuming a separable covariance kernel, were obtained by maximum likelihood estimation. The GP-SM interpolates the training data by design and offers reasonable predictions of relaxation spectra with uncertainty estimates. In general, the accuracy of GP-SMs improves as the size of the training data [Formula: see text] increases, as does the cost for training and prediction. The optimal hyper-parameters were found to be relatively insensitive to [Formula: see text]. Finally, we considered the inverse problem of inferring the structure of the polymer blend from a synthetic dataset generated using the true model. Surprisingly, the solution to the inverse problem obtained using GP-SMs and TDD-DR was qualitatively similar. GP-SMs can be several orders of magnitude cheaper than expensive rheological models, which provides a proof-of-concept validation for using GP-SMs for inverse problems in polymer rheology.more » « less
An official website of the United States government
