We study the Bayesian multi-task variable selection problem, where the goal is to select activated variables for multiple related data sets simultaneously. We propose a new variational Bayes algorithm which generalizes and improves the recently developed “sum of single effects” model of Wang et al. (2020a). Motivated by differential gene network analysis in biology, we further extend our method to joint structure learning of multiple directed acyclic graphical models, a problem known to be computationally highly challenging. We propose a novel order MCMC sampler where our multi-task variable selection algorithm is used to quickly evaluate the posterior probability of each ordering. Both simulation studies and real gene expression data analysis are conducted to show the efficiency of our method. Finally, we also prove a posterior consistency result for multi-task variable selection, which provides a theoretical guarantee for the proposed algorithms. Supplementary materials for this article are available online.
more »
« less
A Variable Selection Method for Improving Variable Selection Consistency and Soft Sensor Performance
In the last few decades, various spectroscopic soft sensors that predict sample properties from its spectroscopic readings have been reported. To improve prediction performance, variable selection that aims to eliminate irrelevant wavelengths is often performed prior to soft sensor model building. However, due to the data-driven nature of many variable selection methods, they can be sensitive to the choice of the training data, and oftentimes the selected wavelengths show little connection to the underlying chemical bonds or function groups that determine the property of the sample. To address these limitations, we proposed a new variable selection method, namely consistency enhanced evolution for variable selection (CEEVS), which focuses on identifying the variables that are consistently selected from different training dataset. To demonstrate the effectiveness and robustness of CEEVS, we compared it with three representative variable selection methods using two published NIR datasets. We show that by identifying variables with high selection consistency, CEEVS not only achieves improved soft sensor performance, but also identifies key chemical information from spectroscopic data.
more »
« less
- Award ID(s):
- 1805950
- PAR ID:
- 10288563
- Date Published:
- Journal Name:
- 2020 American Control Conference (ACC)
- Page Range / eLocation ID:
- 725 to 730
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students’ academic performance. This problem has three analytical challenges: (a) academic performance is measured by cognitive items under a matrix sampling design; (b) there are many missing values in the non-cognitive variables; and (c) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment, aiming to identify non-cognitive variables associated with students’ performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections.more » « less
-
Forward selection (FS) is a popular variable selection method for linear regression. But theoretical understanding of FS with a diverging number of covariates is still limited. We derive sufficient conditions for FS to attain model selection consistency. Our conditions are similar to those for orthogonal matching pursuit, but are obtained using a different argument. When the true model size is unknown, we derive sufficient conditions for model selection consistency of FS with a data‐driven stopping rule, based on a sequential variant of cross‐validation. As a byproduct of our proofs, we also have a sharp (sufficient and almost necessary) condition for model selection consistency of “wrapper” forward search for linear regression. We illustrate intuition and demonstrate performance of our methods using simulation studies and real datasets.more » « less
-
Steel, Mark (Ed.)Bayesian cross-validation (CV) is a popular method for predictive model assessment that is simple to implement and broadly applicable. A wide range of CV schemes is available for time series applications, including generic leave-one-out (LOO) and K-fold methods, as well as specialized approaches intended to deal with serial dependence such as leave-future-out (LFO), h-block, and hv-block. Existing large-sample results show that both specialized and generic methods are applicable to models of serially-dependent data. However, large sample consistency results overlook the impact of sampling variability on accuracy in finite samples. Moreover, the accuracy of a CV scheme depends on many aspects of the procedure. We show that poor design choices can lead to elevated rates of adverse selection. In this paper, we consider the problem of identifying the regression component of an important class of models of data with serial dependence, autoregressions of order p with q exogenous regressors (ARX(p,q)), under the logarithmic scoring rule. We show that when serial dependence is present, scores computed using the joint (multivariate) density have lower variance and better model selection accuracy than the popular pointwise estimator. In addition, we present a detailed case study of the special case of ARX models with fixed autoregressive structure and variance. For this class, we derive the finite-sample distribution of the CV estimators and the model selection statistic. We conclude with recommendations for practitioners.more » « less
-
Unsupervised feature selection aims to select a subset from the original features that are most useful for the downstream tasks without external guidance information. While most unsupervised feature selection methods focus on ranking features based on the intrinsic properties of data, most of them do not pay much attention to the relationships between features, which often leads to redundancy among the selected features. In this paper, we propose a two-stage Second-Order unsupervised Feature selection via knowledge contrastive disTillation (SOFT) model that incorporates the second-order covariance matrix with the first-order data matrix for unsupervised feature selection. In the first stage, we learn a sparse attention matrix that can represent second-order relations between features by contrastively distilling the intrinsic structure. In the second stage, we build a relational graph based on the learned attention matrix and perform graph segmentation. To this end, we conduct feature selection by only selecting one feature from each cluster to decrease the feature redundancy. Experimental results on 12 public datasets show that SOFT outperforms classical and recent state-of-the-art methods, which demonstrates the effectiveness of our proposed method. Moreover, we also provide rich in-depth experiments to further explore several key factors of SOFT.more » « less
An official website of the United States government

