The reproducibility debate has caused a renewed interest in changing how one reports uncertainty, from 𝑝-value for testing a null hypothesis to a confidence interval (CI) for the corresponding parameter. When CIs for multiple selected parameters are being reported, the analog of the false discovery rate (FDR) is the false coverage rate (FCR), which is the expected ratio of number of reported CIs failing to cover their respective parameters to the total number of reported CIs. Here, we consider the general problem of FCR control in the online setting, where one encounters an infinite sequence of fixed unknown parameters ordered by time. We propose a novel solution to the problem which only requires the scientist to be able to construct marginal CIs. As special cases, our framework yields algorithms for online FDR control and online sign-classification procedures that control the false sign rate (FSR). All of our methodology applies equally well to prediction intervals, having particular implications for selective conformal inference.
more »
« less
MixTwice: large-scale hypothesis testing for peptide arrays by variance mixing
Abstract Summary Peptide microarrays have emerged as a powerful technology in immunoproteomics as they provide a tool to measure the abundance of different antibodies in patient serum samples. The high dimensionality and small sample size of many experiments challenge conventional statistical approaches, including those aiming to control the false discovery rate (FDR). Motivated by limitations in reproducibility and power of current methods, we advance an empirical Bayesian tool that computes local FDR statistics and local false sign rate statistics when provided with data on estimated effects and estimated standard errors from all the measured peptides. As the name suggests, the MixTwice tool involves the estimation of two mixing distributions, one on underlying effects and one on underlying variance parameters. Constrained optimization techniques provide for model fitting of mixing distributions under weak shape constraints (unimodality of the effect distribution). Numerical experiments show that MixTwice can accurately estimate generative parameters and powerfully identify non-null peptides. In a peptide array study of rheumatoid arthritis, MixTwice recovers meaningful peptide markers in one case where the signal is weak, and has strong reproducibility properties in one case where the signal is strong. Availabilityand implementation MixTwice is available as an R software package https://cran.r-project.org/web/packages/MixTwice/. Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
- Award ID(s):
- 2023239
- PAR ID:
- 10281606
- Editor(s):
- Valencia, Alfonso
- Date Published:
- Journal Name:
- Bioinformatics
- ISSN:
- 1367-4803
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.more » « less
-
Mura, Cameron (Ed.)Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model’s prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited.more » « less
-
Fu, Yan (Ed.)Abstract Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide–spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.more » « less
-
null (Ed.)omparing the spatial characteristics of spatiotemporal random fields is often at demand. However, the comparison can be challenging due to the high-dimensional feature and dependency in the data. We develop a new multiple testing approach to detect local differences in the spatial characteristics of two spatiotemporal random fields by taking the spatial information into account. Our method adopts a twocomponent mixture model for location wise p-values and then derives a new false discovery rate (FDR) control, called mirror procedure, to determine the optimal rejection region. This procedure is robust to model misspecification and allows for weak dependency among hypotheses. To integrate the spatial heterogeneity, we model the mixture probability as well as study the benefit if any of allowing the alternative distribution to be spatially varying. AnEM-algorithm is developed to estimate the mixture model and implement the FDR procedure. We study the FDR control and the power of our new approach both theoretically and numerically, and apply the approach to compare the mean and teleconnection pattern between two synthetic climate fields. Supplementary materials for this article are available online.more » « less
An official website of the United States government

