We discuss inference after data exploration, with a particular focus on inference after model or variable selection. We review three popular approaches to this problem: sample splitting, simultaneous inference, and conditional selective inference. For each approach, we explain how it works, and highlight its advantages and disadvantages. We also provide an illustration of these post-selection inference approaches.
more »
« less
Post-Selection Inference
We discuss inference after data exploration, with a particular focus on inference after model or variable selection. We review three popular approaches to this problem: sample splitting, simultaneous inference, and conditional selective inference. We explain how each approach works and highlight its advantages and disadvantages. We also provide an illustration of these post-selection inference approaches.
more »
« less
- PAR ID:
- 10400405
- Date Published:
- Journal Name:
- Annual Review of Statistics and Its Application
- Volume:
- 9
- Issue:
- 1
- ISSN:
- 2326-8298
- Page Range / eLocation ID:
- 505 to 527
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest—for example, inference for the effects of selected variables on the outcome—remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment.more » « less
-
In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constant-stepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited.more » « less
-
Abstract Selecting among competing statistical models is a core challenge in science. However, the many possible approaches and techniques for model selection, and the conflicting recommendations for their use, can be confusing. We contend that much confusion surrounding statistical model selection results from failing to first clearly specify the purpose of the analysis. We argue that there are three distinct goals for statistical modeling in ecology: data exploration, inference, and prediction. Once the modeling goal is clearly articulated, an appropriate model selection procedure is easier to identify. We review model selection approaches and highlight their strengths and weaknesses relative to each of the three modeling goals. We then present examples of modeling for exploration, inference, and prediction using a time series of butterfly population counts. These show how a model selection approach flows naturally from the modeling goal, leading to different models selected for different purposes, even with exactly the same data set. This review illustrates best practices for ecologists and should serve as a reminder that statistical recipes cannot substitute for critical thinking or for the use of independent data to test hypotheses and validate predictions.more » « less
-
Abstract Prior approaches to AS-aware path selection in Tor do not consider node bandwidth or the other characteristics that Tor uses to ensure load balancing and quality of service. Further, since the AS path from the client’s exit to her destination can only be inferred once the destination is known, the prior approaches may have problems constructing circuits in advance, which is important for Tor performance. In this paper, we propose and evaluate DeNASA, a new approach to AS-aware path selection that is destination-naive, in that it does not need to know the client’s destination to pick paths, and that takes advantage of Tor’s circuit selection algorithm. To this end, we first identify the most probable ASes to be traversed by Tor streams. We call this set of ASes the Suspect AS list and find that it consists of eight highest ranking Tier 1 ASes. Then, we test the accuracy of Qiu and Gao AS-level path inference on identifying the presence of these ASes in the path, and we show that inference accuracy is 90%. We develop an AS-aware algorithm called DeNASA that uses Qiu and Gao inference to avoid Suspect ASes. DeNASA reduces Tor stream vulnerability by 74%. We also show that DeNASA has performance similar to Tor. Due to the destination-naive property, time to first byte (TTFB) is close to Tor’s, and due to leveraging Tor’s bandwidth-weighted relay selection, time to last byte (TTLB) is also similar to Tor’s.more » « less