skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Variable selection in latent variable models via knockoffs: an application to international large-scale assessment in education
Abstract International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students’ academic performance. This problem has three analytical challenges: (a) academic performance is measured by cognitive items under a matrix sampling design; (b) there are many missing values in the non-cognitive variables; and (c) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment, aiming to identify non-cognitive variables associated with students’ performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections.  more » « less
Award ID(s):
1915099
PAR ID:
10555417
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series A: Statistics in Society
Volume:
187
Issue:
3
ISSN:
0964-1998
Page Range / eLocation ID:
723 to 747
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In the last few decades, various spectroscopic soft sensors that predict sample properties from its spectroscopic readings have been reported. To improve prediction performance, variable selection that aims to eliminate irrelevant wavelengths is often performed prior to soft sensor model building. However, due to the data-driven nature of many variable selection methods, they can be sensitive to the choice of the training data, and oftentimes the selected wavelengths show little connection to the underlying chemical bonds or function groups that determine the property of the sample. To address these limitations, we proposed a new variable selection method, namely consistency enhanced evolution for variable selection (CEEVS), which focuses on identifying the variables that are consistently selected from different training dataset. To demonstrate the effectiveness and robustness of CEEVS, we compared it with three representative variable selection methods using two published NIR datasets. We show that by identifying variables with high selection consistency, CEEVS not only achieves improved soft sensor performance, but also identifies key chemical information from spectroscopic data. 
    more » « less
  2. null (Ed.)
    Background Ecological communities tend to be spatially structured due to environmental gradients and/or spatially contagious processes such as growth, dispersion and species interactions. Data transformation followed by usage of algorithms such as Redundancy Analysis (RDA) is a fairly common approach in studies searching for spatial structure in ecological communities, despite recent suggestions advocating the use of Generalized Linear Models (GLMs). Here, we compared the performance of GLMs and RDA in describing spatial structure in ecological community composition data. We simulated realistic presence/absence data typical of many β -diversity studies. For model selection we used standard methods commonly used in most studies involving RDA and GLMs. Methods We simulated communities with known spatial structure, based on three real spatial community presence/absence datasets (one terrestrial, one marine and one freshwater). We used spatial eigenvectors as explanatory variables. We varied the number of non-zero coefficients of the spatial variables, and the spatial scales with which these coefficients were associated and then compared the performance of GLMs and RDA frameworks to correctly retrieve the spatial patterns contained in the simulated communities. We used two different methods for model selection, Forward Selection (FW) for RDA and the Akaike Information Criterion (AIC) for GLMs. The performance of each method was assessed by scoring overall accuracy as the proportion of variables whose inclusion/exclusion status was correct, and by distinguishing which kind of error was observed for each method. We also assessed whether errors in variable selection could affect the interpretation of spatial structure. Results Overall GLM with AIC-based model selection (GLM/AIC) performed better than RDA/FW in selecting spatial explanatory variables, although under some simulations the methods performed similarly. In general, RDA/FW performed unpredictably, often retaining too many explanatory variables and selecting variables associated with incorrect spatial scales. The spatial scale of the pattern had a negligible effect on GLM/AIC performance but consistently affected RDA’s error rates under almost all scenarios. Conclusion We encourage the use of GLM/AIC for studies searching for spatial drivers of species presence/absence patterns, since this framework outperformed RDA/FW in situations most likely to be found in natural communities. It is likely that such recommendations might extend to other types of explanatory variables. 
    more » « less
  3. PurposeThis study examined differences related to gender and racial/ethnic identity among academic researchers participating in the National Science Foundation’s “Innovation-Corps” (NSF I-Corps) entrepreneurship training program. Drawing from prior research in the fields of technology entrepreneurship and science, technology, engineering and mathematics (STEM) education, this study addresses the goal of broadening participation in academic entrepreneurship. Design/methodology/approachUsing ANOVA and MANOVA analyses, we tested for differences by gender and minoritized racial/ethnic identity for four variables considered pertinent to successful program outcomes: (1) prior entrepreneurial experience, (2) perceptions of instructional climate, (3) quality of project team interactions and (4) future entrepreneurial intention. The sample includes faculty (n = 434) and graduate students (n = 406) who completed pre- and post-course surveys related to a seven-week nationwide training program. FindingsThe findings show that group differences based on minoritized racial/ethnic identity compared with majority group identity were largely not evident. Previous research findings were replicated for only one variable, indicating that women report lower amounts of total prior entrepreneurial experience than men, but no gender differences were found for other study variables. Originality/valueOur analyses respond to repeated calls for research in the fields of entrepreneurship and STEM education to simultaneously examine intersecting minoritized and/or under-represented social identities to inform recruitment and retention efforts. The unique and large I-Corps national dataset offered the statistical power to quantitatively test for differences between identity groups. We discuss the implications of the inconsistencies in our analyses with prior findings, such as the need to consider selection bias. 
    more » « less
  4. The architectures of many neural networks rely heavily on the underlying grid associated with the variables, for instance, the lattice of pixels in an image. For general biomedical data without a grid structure, the multi‐layer perceptron (MLP) and deep belief network (DBN) are often used. However, in these networks, variables are treated homogeneously in the sense of network structure; and it is difficult to assess their individual importance. In this paper, we propose a novel neural network called Variable‐block tree Net (VtNet) whose architecture is determined by an underlying tree with each node corresponding to a subset of variables. The tree is learned from the data to best capture the causal relationships among the variables. VtNet contains a long short‐term memory (LSTM)‐like cell for every tree node. The input and forget gates of each cell control the information flow through the node, and they are used to define a significance score for the variables. To validate the defined significance score, VtNet is trained using smaller trees with variables of low scores removed. Hypothesis tests are conducted to show that variables of higher scores influence classification more strongly. Comparison is made with the variable importance score defined in Random Forest from the aspect of variable selection. Our experiments demonstrate that VtNet is highly competitive in classification accuracy and can often improve accuracy by removing variables with low significance scores. 
    more » « less
  5. Given data obtained under two sampling conditions, it is often of interest to identify variables that behave differently in one condition than in the other. We introduce a method for differential analysis of second-order behavior called Differential Correlation Mining (DCM). The DCM method identifies differentially correlated sets of variables, with the property that the average pairwise correlation between variables in a set is higher under one sample condition than the other. DCM is based on an iterative search procedure that adaptively updates the size and elements of a candidate variable set. Updates are performed via hypothesis testing of individual variables, based on the asymptotic distribution of their average differential correlation. We investigate the performance of DCM by applying it to simulated data as well as to recent experimental datasets in genomics and brain imaging. 
    more » « less