skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Ren, Zhao"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. ABSTRACT In modern statistical applications, identifying critical features in high‐dimensional data is essential for scientific discoveries. Traditional best subset selection methods face computational challenges, while regularization approaches such as Lasso, SCAD and their variants often exhibit poor performance with ultrahigh‐dimensional data. Sure screening methods, widely used for dimensionality reduction, have been developed as popular alternatives, but few target heavy‐tailed characteristics in modern big data. This paper introduces a new sure screening method, based on robust distance correlation (‘RDC’), designed for heavy‐tailed data. The proposed method inherits the benefits of the original model‐free distance correlation‐based screening while robustly estimating distance correlation in the presence of heavy‐tailed data. We further develop an FDR control procedure by incorporating the Reflection via Data Splitting (REDS) method. Extensive simulations demonstrate the method's advantage over existing screening procedures under different scenarios of heavy‐tailedness. Its application to high‐dimensional heavy‐tailed RNA‐seq data from The Cancer Genome Atlas (TCGA) pancreatic cancer cohort showcases superior performance in identifying biologically meaningful genes predictive of MAPK1 protein expression critical to pancreatic cancer. 
    more » « less
    Free, publicly-accessible full text available September 1, 2026
  2. In this paper, we propose differentially private algorithms for robust (multivariate) mean estimation and inference under heavy-tailed distributions, with a focus on Gaussian differential privacy. First, we provide a comprehensive analysis of the Huber mean estimator with increasing dimensions, including non-asymptotic deviation bound, Bahadur representation, and (uniform) Gaussian approximations. Secondly, we privatize the Huber mean estimator via noisy gradient descent, which is proven to achieve near-optimal statistical guarantees. The key is to characterize quantitatively the trade-off between statistical accuracy, degree of robustness and privacy level, governed by a carefully chosen robustification parameter. Finally, we construct private confidence intervals for the proposed estimator by incorporating a private and robust covariance estimator. Our findings are demonstrated by simulation studies. 
    more » « less
    Free, publicly-accessible full text available November 1, 2025
  3. With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multifaceted cluster structures that can be defined by different sets of genes. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a prespecified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association. 
    more » « less
  4. Abstract We address the problem of adaptive minimax density estimation on $$\mathbb{R}^{d}$$ with $$L_{p}$$ loss functions under Huber’s contamination model. To investigate the contamination effect on the optimal estimation of the density, we first establish the minimax rate with the assumption that the density is in an anisotropic Nikol’skii class. We then develop a data-driven bandwidth selection procedure for kernel estimators, which can be viewed as a robust generalization of the Goldenshluger-Lepski method. We show that the proposed bandwidth selection rule can lead to the estimator being minimax adaptive to either the smoothness parameter or the contamination proportion. When both of them are unknown, we prove that finding any minimax-rate adaptive method is impossible. Extensions to smooth contamination cases are also discussed. 
    more » « less
  5. Abstract MotivationThe advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator–gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator–gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone. ResultsWe propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator–gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method. Availability and implementationThe R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  6. null (Ed.)