skip to main content


Title: A weighted FDR procedure under discrete and heterogeneous null distributions
Abstract

Multiple testing (MT) with false discovery rate (FDR) control has been widely conducted in the “discrete paradigm” wherep‐values have discrete and heterogeneous null distributions. However, in this scenario existing FDR procedures often lose some power and may yield unreliable inference, and for this scenario there does not seem to be an FDR procedure that partitions hypotheses into groups, employs data‐adaptive weights and is nonasymptotically conservative. We propose a weightedp‐value‐based FDR procedure, “weighted FDR (wFDR) procedure” for short, for MT in the discrete paradigm that efficiently adapts to both heterogeneity and discreteness ofp‐value distributions. We theoretically justify the nonasymptotic conservativeness of the wFDR procedure under independence, and show via simulation studies that, for MT based onp‐values of binomial test or Fisher's exact test, it is more powerful than six other procedures. The wFDR procedure is applied to two examples based on discrete data, a drug safety study, and a differential methylation study, where it makes more discoveries than two existing methods.

 
more » « less
NSF-PAR ID:
10457106
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Biometrical Journal
Volume:
62
Issue:
6
ISSN:
0323-3847
Page Range / eLocation ID:
p. 1544-1563
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Adaptive multiple testing with covariates is an important research direction that has gained major attention in recent years. It has been widely recognised that leveraging side information provided by auxiliary covariates can improve the power of false discovery rate (FDR) procedures. Currently, most such procedures are devised with p-values as their main statistics. However, for two-sided hypotheses, the usual data processing step that transforms the primary statistics, known as p-values, into p-values not only leads to a loss of information carried by the main statistics, but can also undermine the ability of the covariates to assist with the FDR inference. We develop a p-value based covariate-adaptive (ZAP) methodology that operates on the intact structural information encoded jointly by the p-values and covariates. It seeks to emulate the oracle p-value procedure via a working model, and its rejection regions significantly depart from those of the p-value adaptive testing approaches. The key strength of ZAP is that the FDR control is guaranteed with minimal assumptions, even when the working model is misspecified. We demonstrate the state-of-the-art performance of ZAP using both simulated and real data, which shows that the efficiency gain can be substantial in comparison with p-value-based methods. Our methodology is implemented in the R package zap.

     
    more » « less
  2. Combining SNP p -values from GWAS summary data is a promising strategy for detecting novel genetic factors. Existing statistical methods for the p -value-based SNP-set testing confront two challenges. First, the statistical power of different methods depends on unknown patterns of genetic effects that could drastically vary over different SNP sets. Second, they do not identify which SNPs primarily contribute to the global association of the whole set. We propose a new signal-adaptive analysis pipeline to address these challenges using the omnibus thresholding Fisher’s method (oTFisher). The oTFisher remains robustly powerful over various patterns of genetic effects. Its adaptive thresholding can be applied to estimate important SNPs contributing to the overall significance of the given SNP set. We develop efficient calculation algorithms to control the type I error rate, which accounts for the linkage disequilibrium among SNPs. Extensive simulations show that the oTFisher has robustly high power and provides a higher balanced accuracy in screening SNPs than the traditional Bonferroni and FDR procedures. We applied the oTFisher to study the genetic association of genes and haplotype blocks of the bone density-related traits using the summary data of the Genetic Factors for Osteoporosis Consortium. The oTFisher identified more novel and literature-reported genetic factors than existing p -value combination methods. Relevant computation has been implemented into the R package TFisher to support similar data analysis. 
    more » « less
  3. Differential privacy provides a rigorous framework for privacy-preserving data analysis. This paper proposes the first differentially private procedure for controlling the false discovery rate (FDR) in multiple hypothesis testing. Inspired by the Benjamini-Hochberg procedure (BHq), our approach is to first repeatedly add noise to the logarithms of the p-values to ensure differential privacy and to select an approximately smallest p-value serving as a promising candidate at each iteration; the selected p-values are further supplied to the BHq and our private procedure releases only the rejected ones. Moreover, we develop a new technique that is based on a backward submartingale for proving FDR control of a broad class of multiple testing procedures, including our private procedure, and both the BHq step- up and step-down procedures. As a novel aspect, the proof works for arbitrary dependence between the true null and false null test statistics, while FDR control is maintained up to a small multiplicative factor. 
    more » « less
  4. Summary

    Many methods for estimation or control of the false discovery rate (FDR) can be improved by incorporating information about π0, the proportion of all tested null hypotheses that are true. Estimates of π0 are often based on the number of p-values that exceed a threshold λ. We first give a finite sample proof for conservative point estimation of the FDR when the λ-parameter is fixed. Then we establish a condition under which a dynamic adaptive procedure, whose λ-parameter is determined by data, will lead to conservative π0- and FDR estimators. We also present asymptotic results on simultaneous conservative FDR estimation and control for a class of dynamic adaptive procedures. Simulation results show that a novel dynamic adaptive procedure achieves more power through smaller estimation errors for π0 under independence and mild dependence conditions. We conclude by discussing the connection between estimation and control of the FDR and show that several recently developed FDR control procedures can be cast in a unifying framework where the strength of the procedures can be easily evaluated.

     
    more » « less
  5. The data provided here accompany the publication "Drought Characterization with GPS: Insights into Groundwater and Reservoir Storage in California" [Young et al., (2024)] which is currently under review with Water Resources Research. (as of 28 May 2024)


    Please refer to the manuscript and its supplemental materials for full details. (A link will be appended following publication)

    File formatting information is listed below, followed by a sub-section of the text describing the Geodetic Drought Index Calculation.



    The longitude, latitude, and label for grid points are provided in the file "loading_grid_lon_lat".




    Time series for each Geodetic Drought Index (GDI) time scale are provided within "GDI_time_series.zip".

    The included time scales are for 00- (daily), 1-, 3-, 6-, 12- 18- 24-, 36-, and 48-month GDI solutions.

    Files are formatted following...

    Title: "grid point label L****"_"time scale"_month

    File Format: ["decimal date" "GDI value"]




    Gridded, epoch-by-epoch, solutions for each time scale are provided within "GDI_grids.zip".

    Files are formatted following...

    Title: GDI_"decimal date"_"time scale"_month

    File Format: ["longitude" "latitude" "GDI value" "grid point label L****"]


    2.2 GEODETIC DROUGHT INDEX CALCULATION

    We develop the GDI following Vicente-Serrano et al. (2010) and Tang et al. (2023), such that the GDI mimics the derivation of the SPEI, and utilize the log-logistic distribution (further details below). While we apply hydrologic load estimates derived from GPS displacements as the input for this GDI (Figure 1a-d), we note that alternate geodetic drought indices could be derived using other types of geodetic observations, such as InSAR, gravity, strain, or a combination thereof. Therefore, the GDI is a generalizable drought index framework.

    A key benefit of the SPEI is that it is a multi-scale index, allowing the identification of droughts which occur across different time scales. For example, flash droughts (Otkin et al., 2018), which may develop over the period of a few weeks, and persistent droughts (>18 months), may not be observed or fully quantified in a uni-scale drought index framework. However, by adopting a multi-scale approach these signals can be better identified (Vicente-Serrano et al., 2010). Similarly, in the case of this GPS-based GDI, hydrologic drought signals are expected to develop at time scales that are both characteristic to the drought, as well as the source of the load variation (i.e., groundwater versus surface water and their respective drainage basin/aquifer characteristics). Thus, to test a range of time scales, the TWS time series are summarized with a retrospective rolling average window of D (daily with no averaging), 1, 3, 6, 12, 18, 24, 36, and 48-months width (where one month equals 30.44 days).

    From these time-scale averaged time series, representative compilation window load distributions are identified for each epoch. The compilation window distributions include all dates that range ±15 days from the epoch in question per year. This allows a characterization of the estimated loads for each day relative to all past/future loads near that day, in order to bolster the sample size and provide more robust parametric estimates [similar to Ford et al., (2016)]; this is a key difference between our GDI derivation and that presented by Tang et al. (2023). Figure 1d illustrates the representative distribution for 01 December of each year at the grid cell co-located with GPS station P349 for the daily TWS solution. Here all epochs between between 16 November and 16 December of each year (red dots), are compiled to form the distribution presented in Figure 1e.

    This approach allows inter-annual variability in the phase and amplitude of the signal to be retained (which is largely driven by variation in the hydrologic cycle), while removing the primary annual and semi-annual signals. Solutions converge for compilation windows >±5 days, and show a minor increase in scatter of the GDI time series for windows of ±3-4 days (below which instability becomes more prevalent). To ensure robust characterization of drought characteristics, we opt for an extended ±15-day compilation window. While Tang et al. (2023) found the log-logistic distribution to be unstable and opted for a normal distribution, we find that, by using the extended compiled distribution, the solutions are stable with negligible differences compared to the use of a normal distribution. Thus, to remain aligned with the SPEI solution, we retain the three-parameter log-logistic distribution to characterize the anomalies. Probability weighted moments for the log-logistic distribution are calculated following Singh et al., (1993) and Vicente-Serrano et al., (2010). The individual moments are calculated following Equation 3.

    These are then used to calculate the L-moments for shape (), scale (), and location () of the three-parameter log-logistic distribution (Equations 4 – 6).

    The probability density function (PDF) and the cumulative distribution function (CDF) are then calculated following Equations 7 and 8, respectively.

    The inverse Gaussian function is used to transform the CDF from estimates of the parametric sample quantiles to standard normal index values that represent the magnitude of the standardized anomaly. Here, positive/negative values represent greater/lower than normal hydrologic storage. Thus, an index value of -1 indicates that the estimated load is approximately one standard deviation dryer than the expected average load on that epoch.

    *Equations can be found in the main text.




     
    more » « less