NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Simultaneous selection of multiple important single nucleotide polymorphisms in familial genome wide association studies data

https://doi.org/10.1038/s41598-023-35379-y

Majumdar, Subhabrata; Basu, Saonli; McGue, Matt; Chatterjee, Snigdhansu (December 2023, Scientific Reports)

Abstract We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as ‘single SNP association analysis’. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach—based on the e-values framework—for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption.
more » « less
Full Text Available
High dimensional, robust, unsupervised record linkage

https://doi.org/10.21307/stattrans-2020-034

Bera, Sabyasachi; Chatterjee, Snigdhansu (August 2020, Statistics in Transition New Series)
Okrasa, Włodzimierz; Lahiri, Partha (Ed.)
Abstract We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.
more » « less
Full Text Available
A Bayesian framework for studying climate anomalies and social conflicts

https://doi.org/10.1002/env.2778

Mukherjee, Ujjal Kumar; Bagozzi, Benjamin E.; Chatterjee, Snigdhansu (March 2023, Environmetrics)

Full Text Available
Exploring approaches for predictive cancer patient digital twins: Opportunities for collaboration and innovation

https://doi.org/10.3389/fdgth.2022.1007784

Stahlberg, Eric A.; Abdel-Rahman, Mohamed; Aguilar, Boris; Asadpoure, Alireza; Beckman, Robert A.; Borkon, Lynn L.; Bryan, Jeffrey N.; Cebulla, Colleen M.; Chang, Young Hwan; Chatterjee, Ansu; et al (October 2022, Frontiers in Digital Health)

We are rapidly approaching a future in which cancer patient digital twins will reach their potential to predict cancer prevention, diagnosis, and treatment in individual patients. This will be realized based on advances in high performance computing, computational modeling, and an expanding repertoire of observational data across multiple scales and modalities. In 2020, the US National Cancer Institute, and the US Department of Energy, through a trans-disciplinary research community at the intersection of advanced computing and cancer research, initiated team science collaborative projects to explore the development and implementation of predictive Cancer Patient Digital Twins. Several diverse pilot projects were launched to provide key insights into important features of this emerging landscape and to determine the requirements for the development and adoption of cancer patient digital twins. Projects included exploring approaches to using a large cohort of digital twins to perform deep phenotyping and plan treatments at the individual level, prototyping self-learning digital twin platforms, using adaptive digital twin approaches to monitor treatment response and resistance, developing methods to integrate and fuse data and observations across multiple scales, and personalizing treatment based on cancer type. Collectively these efforts have yielded increased insights into the opportunities and challenges facing cancer patient digital twin approaches and helped define a path forward. Given the rapidly growing interest in patient digital twins, this manuscript provides a valuable early progress report of several CPDT pilot projects commenced in common, their overall aims, early progress, lessons learned and future directions that will increasingly involve the broader research community.
more » « less
Full Text Available
On weighted multivariate sign functions

https://doi.org/10.1016/j.jmva.2022.105013

Majumdar, Subhabrata; Chatterjee, Snigdhansu (September 2022, Journal of Multivariate Analysis)

Full Text Available
Feature Selection using e-values

Majumdar, S; Chatterjee, S (July 2022, Proceedings of the 39th International Conference on Machine Learning)

In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a p-dimensional feature space, this procedure requires fitting only the full model and evaluating p + 1 models, as opposed to the traditional requirement of fitting and evaluating 2^p models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection
more » « less
Full Text Available
Quantifying Spatial and Temporal Relationships Among Tree-Ring Records

Heyman, M; St_George, S; Chatterjee, S (July 2020, Statistics and applications)

Tree growth rings contain yearly information about climate, extreme weather events, and other growing conditions. In this analysis, we model the relationship strength between tree-ring records with respect to location and time. We employ the discrete wavelet trans- formation on the ring width records in order to de-correlate the observations within each series while simultaneously retrieving time-scale information. Our model then describes correlations among the resulting wavelet coefficients at different temporal scales by distance. Statistical inference through a new version of the wild bootstrap indicates that the relation- ship strength decreases linearly as record pair distance increases, but the slopes differ across temporal scales.
more » « less
Full Text Available

Search for: All records