skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Chatterjee, Snigdhansu"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. We present a new semiparametric extension of the Fay-Herriot model, termed the agnostic Fay-Herriot model (AGFH), in which the sampling-level model is expressed in terms of an unknown general function [Formula: see text]. Thus, the AGFH model can express any distribution in the sampling model since the choice of [Formula: see text] is extremely broad. We propose a Bayesian modelling scheme for AGFH where the unknown function [Formula: see text] is assigned a Gaussian Process prior. Using a Metropolis within Gibbs sampling Markov Chain Monte Carlo scheme, we study the performance of the AGFH model, along with that of a hierarchical Bayesian extension of the Fay-Herriot model. Our analysis shows that the AGFH is an excellent modelling alternative when the sampling distribution is non-Normal, especially in the case where the sampling distribution is bounded. It is also the best choice when the sampling variance is high. However, the hierarchical Bayesian framework and the traditional empirical Bayesian framework can be good modelling alternatives when the signal-to-noise ratio is high, and there are computational constraints. AMS subject classification: 62D05; 62F15 
    more » « less
  2. The melting temperature is important for materials design because of its relationship with thermal stability, synthesis, and processing conditions. Current empirical and computational melting point estimation techniques are limited in scope, computational feasibility, or interpretability. We report the development of a machine learning methodology for predicting melting temperatures of binary ionic solid materials. We evaluated different machine-learning models trained on a dataset of the melting points of 476 non-metallic crystalline binary compounds using materials embeddings constructed from elemental properties and density-functional theory calculations as model inputs. A direct supervised-learning approach yields a mean absolute error of around 180 K but suffers from low interpretability. We find that the fidelity of predictions can further be improved by introducing an additional unsupervised-learning step that first classifies the materials before the melting-point regression. Not only does this two-step model exhibit improved accuracy, but the approach also provides a level of interpretability with insights into feature importance and different types of melting that depend on the specific atomic bonding inside a material. Motivated by this finding, we used a symbolic learning approach to find interpretable physical models for the melting temperature, which recovered the best-performing features from both prior models and provided additional interpretability. 
    more » « less
    Free, publicly-accessible full text available May 28, 2025
  3. Abstract We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as ‘single SNP association analysis’. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach—based on the e-values framework—for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption. 
    more » « less
  4. In hydrology, modeling streamflow remains a challenging task due to the limited availability of basin characteristics information such as soil geology and geomorphology. These characteristics may be noisy due to measurement errors or may be missing altogether. To overcome this challenge, we propose a knowledge-guided, probabilistic inverse modeling method for recovering physical characteristics from streamflow and weather data, which are more readily available. We compare our framework with state-of-the-art inverse models for estimating river basin characteristics. We also show that these estimates offer improvement in streamflow modeling as opposed to using the original basin characteristic values. Our inverse model offers a 3% improvement in R2 for the inverse model (basin characteristic estimation) and 6% for the forward model (streamflow prediction). Our framework also offers improved explainability since it can quantify uncertainty in both the inverse and the forward model. Uncertainty quantification plays a pivotal role in improving the explainability of machine learning models by providing additional insights into the reliability and limitations of model predictions. In our analysis, we assess the quality of the uncertainty estimates. Compared to baseline uncertainty quantification methods, our framework offers a 10% improvement in the dispersion of epistemic uncertainty and a 13% improvement in coverage rate. This information can help stakeholders understand the level of uncertainty associated with the predictions and provide a more comprehensive view of the potential outcomes. 
    more » « less
  5. Shekhar, Shashi; Zhou, Zhi-Hua; Chiang, Yao-Yi; Stiglic, Gregor (Ed.)
    Rapid advancement in inverse modeling methods have brought into light their susceptibility to imperfect data. This has made it imperative to obtain more explainable and trustworthy estimates from these models. In hydrology, basin characteristics can be noisy or missing, impacting streamflow prediction. We propose a probabilistic inverse model framework that can reconstruct robust hydrology basin characteristics from dynamic input weather driver and streamflow response data. We address two aspects of building more explainable inverse models, uncertainty estimation (uncertainty due to imperfect data and imperfect model) and robustness. This can help improve the trust of water managers, handling of noisy data and reduce costs. We also propose an uncertainty based loss regularization that offers removal of 17% of temporal artifacts in reconstructions, 36% reduction in uncertainty and 4% higher coverage rate for basin characteristics. The forward model performance (streamflow estimation) is also improved by 6% using these uncertainty learning based reconstructions. 
    more » « less
  6. Abstract Simulations of future climate contain variability arising from a number of sources, including internal stochasticity and external forcings. However, to the best of our abilities climate models and the true observed climate depend on the same underlying physical processes. In this paper, we simultaneously study the outputs of multiple climate simulation models and observed data, and we seek to leverage their mean structure as well as interdependencies that may reflect the climate’s response to shared forcings. Bayesian modeling provides a fruitful ground for the nuanced combination of multiple climate simulations. We introduce one such approach whereby a Gaussian process is used to represent a mean function common to all simulated and observed climates. Dependent random effects encode possible information contained within and between the plurality of climate model outputs and observed climate data. We propose an empirical Bayes approach to analyze such models in a computationally efficient way. This methodology is amenable to the CMIP6 model ensemble, and we demonstrate its efficacy at forecasting global average near-surface air temperature. Results suggest that this model and the extensions it engenders may provide value to climate prediction and uncertainty quantification. 
    more » « less
  7. With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust. 
    more » « less
  8. We present an overview of four challenging research areas in multiscale physics and engineering as well as four data science topics that may be developed for addressing these challenges. We focus on multiscale spatiotemporal problems in light of the importance of understanding the accompanying scientific processes and engineering ideas, where “multiscale” refers to concurrent, non-trivial and coupled models over scales separated by orders of magnitude in either space, time, energy, momenta, or any other relevant parameter. Specifically, we consider problems where the data may be obtained at various resolutions; analyzing such data and constructing coupled models led to open research questions in various applications of data science. Numeric studies are reported for one of the data science techniques discussed here for illustration, namely, on approximate Bayesian computations. 
    more » « less