NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ARK: Robust knockoffs inference with coupling

https://doi.org/10.1214/24-AOS2480

Fan, Yingying; Gao, Lan; Lv, Jinchi (April 2025, The Annals of Statistics)

We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and k-familywise error rate (k-FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or k-FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework. Additionally, we formally connect our concept of knockoff variable coupling to a type of Wasserstein distance.
more » « less
Free, publicly-accessible full text available April 1, 2026
Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors

https://doi.org/10.1080/01621459.2022.2115375

Demirkaya, Emre; Fan, Yingying; Gao, Lan; Lv, Jinchi; Vossler, Patrick; Wang, Jingbo (October 2022, Journal of the American Statistical Association)

The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele, 2009; Biau et al., 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing nite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application.
more » « less
Full Text Available
SIMPLE: Statistical inference on membership profiles in large networks

https://doi.org/10.1111/rssb.12505

Fan, Jianqing; Fan, Yingying; Han, Xiao; Lv, Jinchi (April 2022, Journal of the Royal Statistical Society: Series B (Statistical Methodology))

Full Text Available
Asymptotic distributions of high-dimensional distance correlation inference

https://doi.org/10.1214/20-aos2024

Gao, Lan; Fan, Yingying; Lv, Jinchi; Shao, Qi-Man (August 2021, The Annals of Statistics)

Full Text Available
Not Registered? Please Sign Up First: A Randomized Field Experiment on the Ex Ante Registration Request

https://doi.org/10.1287/isre.2021.0999

Huang, Ni; Mojumder, Probal; Sun, Tianshu; Lv, Jinchi; Golden, Joseph M. (September 2021, Information Systems Research)

Online commerce websites often request users to register in the online shopping process. Recognizing the challenges of user registration, many websites opt to delay their registration request until the end of the conversion funnel (i.e., ex post registration request). Our study explores an alternative approach by asking users to register with the website at the beginning of their shopping journey (i.e., ex ante registration request). Guided by a stylized analytical model, we conducted a large-scale randomized field experiment in partnership with an online retailer in the United States to examine how the ex ante request affects users’ registration decisions, short-term customer conversions, and long-term purchase behaviors. Specifically, we randomly assigned the new users in the website’s incoming traffic to one of two experimental groups: one with an ex ante registration request preceding the ex post request (treatment) and the other with only an ex post registration request (control). Our results show that the ex ante request leads to an increased probability of user registration; that is, the users in the treatment group, on average, are 58.08% relatively more likely to register with the website than those in the control group. Furthermore, the ex ante request leads to significant increases in customer purchases in the long run. Based on our estimation of the local average treatment effects, the ex ante registered users are 10.89% relatively more likely to make a purchase, place a 16.76% relatively greater number of orders, and generate 13.22% relatively higher total revenue for the firm in the long run. Finally, the ex ante request also does not impact customer conversion in the short-term. Further investigation into the long-term and short-term effects provides suggestive evidence on several potential mechanisms, such as firm-initiated interaction and screening of low-interest users. Our study provides managerial implications to the e-commerce websites on customer acquisition and contributes to the research on IT artifact design.
more » « less
Full Text Available
Nonsparse Learning with Latent Variables

https://doi.org/10.1287/opre.2020.2005

Zheng, Zemin; Lv, Jinchi; Lin, Wei (January 2021, Operations Research)
null (Ed.)
As a popular tool for producing meaningful and interpretable models, large-scale sparse learning works efficiently in many optimization applications when the underlying structures are indeed or close to sparse. However, naively applying the existing regularization methods can result in misleading outcomes because of model misspecification. In this paper, we consider nonsparse learning under the factors plus sparsity structure, which yields a joint modeling of sparse individual effects and common latent factors. A new methodology of nonsparse learning with latent variables (NSL) is proposed for joint estimation of the effects of two groups of features, one for individual effects and the other associated with the latent substructures, when the nonsparse effects are captured by the leading population principal component score vectors. We derive the convergence rates of both sample principal components and their score vectors that hold for a wide class of distributions. With the properly estimated latent variables, properties including model selection consistency and oracle inequalities under various prediction and estimation losses are established. Our new methodology and results are evidenced by simulation and real-data examples.
more » « less
Full Text Available
Large-scale model selection in misspecified generalized linear models

https://doi.org/10.1093/biomet/asab005

Demirkaya, Emre; Feng, Yang; Basu, Pallavi; Lv, Jinchi (January 2021, Biometrika)

Summary Model selection is crucial both to high-dimensional learning and to inference for contemporary big data applications in pinpointing the best set of covariates among a sequence of candidate interpretable models. Most existing work implicitly assumes that the models are correctly specified or have fixed dimensionality, yet both model misspecification and high dimensionality are prevalent in practice. In this paper, we exploit the framework of model selection principles under the misspecified generalized linear models presented in Lv & Liu (2014), and investigate the asymptotic expansion of the posterior model probability in the setting of high-dimensional misspecified models. With a natural choice of prior probabilities that encourages interpretability and incorporates the Kullback–Leibler divergence, we suggest using the high-dimensional generalized Bayesian information criterion with prior probability for large-scale model selection with misspecification. Our new information criterion characterizes the impacts of both model misspecification and high dimensionality on model selection. We further establish the consistency of covariance contrast matrix estimation and the model selection consistency of the new information criterion in ultrahigh dimensions under some mild regularity conditions. Our numerical studies demonstrate that the proposed method enjoys improved model selection consistency over its main competitors.
more » « less
Full Text Available
Asymptotic Theory of Eigenvectors for Random Matrices With Diverging Spikes

https://doi.org/10.1080/01621459.2020.1840990

Fan, Jianqing; Fan, Yingying; Han, Xiao; Lv, Jinchi (January 2020, Journal of the American Statistical Association)

Full Text Available
Tuning-Free Heterogeneous Inference in Massive Networks

https://doi.org/10.1080/01621459.2018.1537920

Ren, Zhao; Kang, Yongjian; Fan, Yingying; Lv, Jinchi (October 2019, Journal of the American Statistical Association)

Full Text Available
DeepLINK: Deep learning inference using knockoffs with applications to genomics

https://doi.org/10.1073/pnas.2104683118

Zhu, Zifan; Fan, Yingying; Kong, Yinfei; Lv, Jinchi; Sun, Fengzhu (September 2021, Proceedings of the National Academy of Sciences)

Significance Although practically attractive with high prediction and classification power, complicated learning methods often lack interpretability and reproducibility, limiting their scientific usage. A useful remedy is to select truly important variables contributing to the response of interest. We develop a method for deep learning inference using knockoffs, DeepLINK, to achieve the goal of variable selection with controlled error rate in deep learning models. We show that DeepLINK can also have high power in variable selection with a broad class of model designs. We then apply DeepLINK to three real datasets and produce statistical inference results with both reproducibility and biological meanings, demonstrating its promising usage to a broad range of scientific applications.
more » « less

Search for: All records