NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Bayesian Semi-parametric Modelling Approach for Area Level Small Area Studies

https://doi.org/10.1177/00080683231198606

Thompson, Marten; Chatterjee, Snigdhansu (May 2024, Calcutta Statistical Association Bulletin)

We present a new semiparametric extension of the Fay-Herriot model, termed the agnostic Fay-Herriot model (AGFH), in which the sampling-level model is expressed in terms of an unknown general function [Formula: see text]. Thus, the AGFH model can express any distribution in the sampling model since the choice of [Formula: see text] is extremely broad. We propose a Bayesian modelling scheme for AGFH where the unknown function [Formula: see text] is assigned a Gaussian Process prior. Using a Metropolis within Gibbs sampling Markov Chain Monte Carlo scheme, we study the performance of the AGFH model, along with that of a hierarchical Bayesian extension of the Fay-Herriot model. Our analysis shows that the AGFH is an excellent modelling alternative when the sampling distribution is non-Normal, especially in the case where the sampling distribution is bounded. It is also the best choice when the sampling variance is high. However, the hierarchical Bayesian framework and the traditional empirical Bayesian framework can be good modelling alternatives when the signal-to-noise ratio is high, and there are computational constraints. AMS subject classification: 62D05; 62F15
more » « less
Full Text Available
Discovering melting temperature prediction models of inorganic solids by combining supervised and unsupervised learning

https://doi.org/10.1063/5.0207033

Gharakhanyan, Vahe; Wirth, Luke J; Garrido_Torres, Jose A; Eisenberg, Ethan; Wang, Ting; Trinkle, Dallas R; Chatterjee, Snigdhansu; Urban, Alexander (May 2024, The Journal of Chemical Physics)

The melting temperature is important for materials design because of its relationship with thermal stability, synthesis, and processing conditions. Current empirical and computational melting point estimation techniques are limited in scope, computational feasibility, or interpretability. We report the development of a machine learning methodology for predicting melting temperatures of binary ionic solid materials. We evaluated different machine-learning models trained on a dataset of the melting points of 476 non-metallic crystalline binary compounds using materials embeddings constructed from elemental properties and density-functional theory calculations as model inputs. A direct supervised-learning approach yields a mean absolute error of around 180 K but suffers from low interpretability. We find that the fidelity of predictions can further be improved by introducing an additional unsupervised-learning step that first classifies the materials before the melting-point regression. Not only does this two-step model exhibit improved accuracy, but the approach also provides a level of interpretability with insights into feature importance and different types of melting that depend on the specific atomic bonding inside a material. Motivated by this finding, we used a symbolic learning approach to find interpretable physical models for the melting temperature, which recovered the best-performing features from both prior models and provided additional interpretability.
more » « less
Full Text Available
Simultaneous selection of multiple important single nucleotide polymorphisms in familial genome wide association studies data

https://doi.org/10.1038/s41598-023-35379-y

Majumdar, Subhabrata; Basu, Saonli; McGue, Matt; Chatterjee, Snigdhansu (December 2023, Scientific Reports)

Abstract We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as ‘single SNP association analysis’. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach—based on the e-values framework—for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption.
more » « less
Full Text Available
A Bayesian framework for studying climate anomalies and social conflicts

https://doi.org/10.1002/env.2778

Mukherjee, Ujjal Kumar; Bagozzi, Benjamin E.; Chatterjee, Snigdhansu (March 2023, Environmetrics)

Full Text Available
Uncertainty Quantification in Inverse Models in Hydrology

Chatterjee, Somya Sharma; Ghosh, Rahul; Renganathan, Arvind; Li, Xiang; Chatterjee, Snigdhansu; Nieber, John; Duffy, Christopher; Kumar, Vipin (August 2023, ACM)

In hydrology, modeling streamflow remains a challenging task due to the limited availability of basin characteristics information such as soil geology and geomorphology. These characteristics may be noisy due to measurement errors or may be missing altogether. To overcome this challenge, we propose a knowledge-guided, probabilistic inverse modeling method for recovering physical characteristics from streamflow and weather data, which are more readily available. We compare our framework with state-of-the-art inverse models for estimating river basin characteristics. We also show that these estimates offer improvement in streamflow modeling as opposed to using the original basin characteristic values. Our inverse model offers a 3% improvement in R2 for the inverse model (basin characteristic estimation) and 6% for the forward model (streamflow prediction). Our framework also offers improved explainability since it can quantify uncertainty in both the inverse and the forward model. Uncertainty quantification plays a pivotal role in improving the explainability of machine learning models by providing additional insights into the reliability and limitations of model predictions. In our analysis, we assess the quality of the uncertainty estimates. Compared to baseline uncertainty quantification methods, our framework offers a 10% improvement in the dispersion of epistemic uncertainty and a 13% improvement in coverage rate. This information can help stakeholders understand the level of uncertainty associated with the predictions and provide a more comprehensive view of the potential outcomes.
more » « less
Full Text Available
On weighted multivariate sign functions

https://doi.org/10.1016/j.jmva.2022.105013

Majumdar, Subhabrata; Chatterjee, Snigdhansu (September 2022, Journal of Multivariate Analysis)

Full Text Available
Probabilistic Inverse Modeling: An Application in Hydrology

https://doi.org/10.1137/1.9781611977653.ch95

Sharma, Somya; Ghosh, Rahul; Renganathan, Arvind; Xiang, Li; Chatterjee, Snigdhansu; Nieber, John; Duffy, Christopher; Kumar, Vipin (April 2023, Proceedings of the 2023 SIAM International Conference on Data Mining (SDM))
Shekhar, Shashi; Zhou, Zhi-Hua; Chiang, Yao-Yi; Stiglic, Gregor (Ed.)
Rapid advancement in inverse modeling methods have brought into light their susceptibility to imperfect data. This has made it imperative to obtain more explainable and trustworthy estimates from these models. In hydrology, basin characteristics can be noisy or missing, impacting streamflow prediction. We propose a probabilistic inverse model framework that can reconstruct robust hydrology basin characteristics from dynamic input weather driver and streamflow response data. We address two aspects of building more explainable inverse models, uncertainty estimation (uncertainty due to imperfect data and imperfect model) and robustness. This can help improve the trust of water managers, handling of noisy data and reduce costs. We also propose an uncertainty based loss regularization that offers removal of 17% of temporal artifacts in reconstructions, 36% reduction in uncertainty and 4% higher coverage rate for basin characteristics. The forward model performance (streamflow estimation) is also improved by 6% using these uncertainty learning based reconstructions.
more » « less
Full Text Available
A dependent multimodel approach to climate prediction with Gaussian processes

https://doi.org/10.1017/eds.2022.24

Thompson, Marten; Braverman, Amy; Chatterjee, Snigdhansu (January 2022, Environmental Data Science)

Abstract Simulations of future climate contain variability arising from a number of sources, including internal stochasticity and external forcings. However, to the best of our abilities climate models and the true observed climate depend on the same underlying physical processes. In this paper, we simultaneously study the outputs of multiple climate simulation models and observed data, and we seek to leverage their mean structure as well as interdependencies that may reflect the climate’s response to shared forcings. Bayesian modeling provides a fruitful ground for the nuanced combination of multiple climate simulations. We introduce one such approach whereby a Gaussian process is used to represent a mean function common to all simulated and observed climates. Dependent random effects encode possible information contained within and between the plurality of climate model outputs and observed climate data. We propose an empirical Bayes approach to analyze such models in a computationally efficient way. This methodology is amenable to the CMIP6 model ensemble, and we demonstrate its efficacy at forecasting global average near-surface air temperature. Results suggest that this model and the extensions it engenders may provide value to climate prediction and uncertainty quantification.
more » « less
Full Text Available
Winsorization for Robust Bayesian Neural Networks

https://doi.org/10.3390/e23111546

Sharma, Somya; Chatterjee, Snigdhansu (November 2021, Entropy)

With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust.
more » « less
Full Text Available
Machine Learning Methods for Multiscale Physics and Urban Engineering Problems

https://doi.org/10.3390/e24081134

Sharma, Somya; Thompson, Marten; Laefer, Debra; Lawler, Michael; McIlhany, Kevin; Pauluis, Olivier; Trinkle, Dallas R.; Chatterjee, Snigdhansu (August 2022, Entropy)

We present an overview of four challenging research areas in multiscale physics and engineering as well as four data science topics that may be developed for addressing these challenges. We focus on multiscale spatiotemporal problems in light of the importance of understanding the accompanying scientific processes and engineering ideas, where “multiscale” refers to concurrent, non-trivial and coupled models over scales separated by orders of magnitude in either space, time, energy, momenta, or any other relevant parameter. Specifically, we consider problems where the data may be obtained at various resolutions; analyzing such data and constructing coupled models led to open research questions in various applications of data science. Numeric studies are reported for one of the data science techniques discussed here for illustration, namely, on approximate Bayesian computations.
more » « less
Full Text Available

« Prev Next »

Search for: All records