NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Robust Distance Correlation for Variable Screening

https://doi.org/10.1002/sta4.70094

Ma, Tianzhou; Yang, Fan; Ke, Hongjie; Ren, Zhao (September 2025, Stat)

ABSTRACT In modern statistical applications, identifying critical features in high‐dimensional data is essential for scientific discoveries. Traditional best subset selection methods face computational challenges, while regularization approaches such as Lasso, SCAD and their variants often exhibit poor performance with ultrahigh‐dimensional data. Sure screening methods, widely used for dimensionality reduction, have been developed as popular alternatives, but few target heavy‐tailed characteristics in modern big data. This paper introduces a new sure screening method, based on robust distance correlation (‘RDC’), designed for heavy‐tailed data. The proposed method inherits the benefits of the original model‐free distance correlation‐based screening while robustly estimating distance correlation in the presence of heavy‐tailed data. We further develop an FDR control procedure by incorporating the Reflection via Data Splitting (REDS) method. Extensive simulations demonstrate the method's advantage over existing screening procedures under different scenarios of heavy‐tailedness. Its application to high‐dimensional heavy‐tailed RNA‐seq data from The Cancer Genome Atlas (TCGA) pancreatic cancer cohort showcases superior performance in identifying biologically meaningful genes predictive of MAPK1 protein expression critical to pancreatic cancer.
more » « less
Free, publicly-accessible full text available September 1, 2026
Gaussian differentially private robust mean estimation and inference

https://doi.org/10.3150/23-BEJ1706

Yu, Myeonghun; Ren, Zhao; Zhou, Wen-Xin (November 2024, Bernoulli)

In this paper, we propose differentially private algorithms for robust (multivariate) mean estimation and inference under heavy-tailed distributions, with a focus on Gaussian differential privacy. First, we provide a comprehensive analysis of the Huber mean estimator with increasing dimensions, including non-asymptotic deviation bound, Bahadur representation, and (uniform) Gaussian approximations. Secondly, we privatize the Huber mean estimator via noisy gradient descent, which is proven to achieve near-optimal statistical guarantees. The key is to characterize quantitatively the trade-off between statistical accuracy, degree of robustness and privacy level, governed by a carefully chosen robustification parameter. Finally, we construct private confidence intervals for the proposed estimator by incorporating a private and robust covariance estimator. Our findings are demonstrated by simulation studies.
more » « less
Full Text Available
Outcome-guided disease subtyping by generative model and weighted joint likelihood in transcriptomic applications

https://doi.org/10.1214/23-AOAS1865

Li, Yujia; Liu, Peng; Wang, Wenjia; Zong, Wei; Fang, Yusi; Ren, Zhao; Tang, Lu; Celedón, Juan C; Oesterreich, Steffi; Tseng, George C (September 2024, The Annals of Applied Statistics)

With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multifaceted cluster structures that can be defined by different sets of genes. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a prespecified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.
more » « less
Full Text Available
Adaptive minimax density estimation on ℝ d for Huber’s contamination model

https://doi.org/10.1093/imaiai/iaad045

Zhang, Peiliang; Ren, Zhao (November 2023, Information and Inference: A Journal of the IMA)

Abstract We address the problem of adaptive minimax density estimation on $$\mathbb{R}^{d}$$ with $$L_{p}$$ loss functions under Huber’s contamination model. To investigate the contamination effect on the optimal estimation of the density, we first establish the minimax rate with the assumption that the density is in an anisotropic Nikol’skii class. We then develop a data-driven bandwidth selection procedure for kernel estimators, which can be viewed as a robust generalization of the Goldenshluger-Lepski method. We show that the proposed bandwidth selection rule can lead to the estimator being minimax adaptive to either the smoothness parameter or the contamination proportion. When both of them are unknown, we prove that finding any minimax-rate adaptive method is impossible. Extensions to smooth contamination cases are also discussed.
more » « less
Covariance-engaged Classification of Sets via Linear Programming

https://doi.org/10.5705/ss.202020.0253

Ren, Zhao; Jung, Sungkyu; Qiao, Xingye (January 2022, Statistica Sinica)
null (Ed.)
Full Text Available
High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression

https://doi.org/10.1093/bioinformatics/btac518

Ke, Hongjie; Ren, Zhao; Qi, Jianfei; Chen, Shuo; Tseng, George_C; Ye, Zhenyao; Ma, Tianzhou; Alkan, ed., Can (July 2022, Bioinformatics)

Abstract MotivationThe advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator–gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator–gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone. ResultsWe propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator–gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method. Availability and implementationThe R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Large-scale multiple inference of collective dependence with applications to protein function

https://doi.org/10.1214/20-AOAS1431

Jernigan, Robert; Jia, Kejue; Ren, Zhao; Zhou, Wen (June 2021, The Annals of Applied Statistics)
null (Ed.)
Full Text Available
Inference of large modified Poisson-type graphical models: Application to RNA-seq data in childhood atopic asthma studies

https://doi.org/10.1214/20-AOAS1413

Zhang, Rong; Ren, Zhao; Celedón, Juan C.; Chen, Wei (June 2021, The Annals of Applied Statistics)
null (Ed.)
Full Text Available
Minimax estimation of large precision matrices with bandable Cholesky factor

https://doi.org/10.1214/19-AOS1893

Liu, Yu; Ren, Zhao (August 2020, The Annals of Statistics)
null (Ed.)
Full Text Available
Variable screening with multiple studies

https://doi.org/10.5705/ss.202017.0439

Ma, Tianzhou; Ren, Zhao; Tseng, George C. (January 2020, Statistica Sinica)

Full Text Available

« Prev Next »

Search for: All records