skip to main content


This content will become publicly available on March 25, 2025

Title: Multiple Testing for IR and Recommendation System Experiments
While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).  more » « less
Award ID(s):
2415042
NSF-PAR ID:
10497108
Author(s) / Creator(s):
;
Publisher / Repository:
Springer
Date Published:
Journal Name:
ECIR 2024: Advances in Information Retrieval
Volume:
14610
Page Range / eLocation ID:
449-457
Subject(s) / Keyword(s):
["recommender systems","evaluation","statistical inference","multiple comparisons"]
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Adaptive multiple testing with covariates is an important research direction that has gained major attention in recent years. It has been widely recognised that leveraging side information provided by auxiliary covariates can improve the power of false discovery rate (FDR) procedures. Currently, most such procedures are devised with p-values as their main statistics. However, for two-sided hypotheses, the usual data processing step that transforms the primary statistics, known as p-values, into p-values not only leads to a loss of information carried by the main statistics, but can also undermine the ability of the covariates to assist with the FDR inference. We develop a p-value based covariate-adaptive (ZAP) methodology that operates on the intact structural information encoded jointly by the p-values and covariates. It seeks to emulate the oracle p-value procedure via a working model, and its rejection regions significantly depart from those of the p-value adaptive testing approaches. The key strength of ZAP is that the FDR control is guaranteed with minimal assumptions, even when the working model is misspecified. We demonstrate the state-of-the-art performance of ZAP using both simulated and real data, which shows that the efficiency gain can be substantial in comparison with p-value-based methods. Our methodology is implemented in the R package zap.

     
    more » « less
  2. Abstract

    Multiple testing (MT) with false discovery rate (FDR) control has been widely conducted in the “discrete paradigm” wherep‐values have discrete and heterogeneous null distributions. However, in this scenario existing FDR procedures often lose some power and may yield unreliable inference, and for this scenario there does not seem to be an FDR procedure that partitions hypotheses into groups, employs data‐adaptive weights and is nonasymptotically conservative. We propose a weightedp‐value‐based FDR procedure, “weighted FDR (wFDR) procedure” for short, for MT in the discrete paradigm that efficiently adapts to both heterogeneity and discreteness ofp‐value distributions. We theoretically justify the nonasymptotic conservativeness of the wFDR procedure under independence, and show via simulation studies that, for MT based onp‐values of binomial test or Fisher's exact test, it is more powerful than six other procedures. The wFDR procedure is applied to two examples based on discrete data, a drug safety study, and a differential methylation study, where it makes more discoveries than two existing methods.

     
    more » « less
  3. Abstract

    The accelerating rate at whichDNAsequence data are now generated by high‐throughput sequencing instruments provides both opportunities and challenges for population genetic and ecological investigations of animals and plants. We show here how the common practice of calling genotypes from a singleSNPper sequenced region ignores substantial additional information in the phased short‐read sequences that are provided by these sequencing instruments. We target sequenced regions with multipleSNPs in kelp rockfish (Sebastes atrovirens) to determine “microhaplotypes” and then call these microhaplotypes as alleles at each locus. We then demonstrate how these multi‐allelic marker data from such loci dramatically increase power for relationship inference. The microhaplotype approach decreases false‐positive rates by several orders of magnitude, relative to calling bi‐allelicSNPs, for two challenging analytical procedures, full‐sibling and single parent–offspring pair identification. We also show how the identification of half‐sibling pairs requires so much data that physical linkage becomes a consideration, and that most published studies that attempt to do so are dramatically underpowered. The advent of phased short‐readDNAsequence data, in conjunction with emerging analytical tools for their analysis, promises to improve efficiency by reducing the number of loci necessary for a particular level of statistical confidence, thereby lowering the cost of data collection and reducing the degree of physical linkage amongst markers used for relationship estimation. Such advances will facilitate collaborative research and management for migratory and other widespread species.

     
    more » « less
  4. null (Ed.)
    Summary The familywise error rate has been widely used in genome-wide association studies. With the increasing availability of functional genomics data, it is possible to increase detection power by leveraging these genomic functional annotations. Previous efforts to accommodate covariates in multiple testing focused on false discovery rate control, while covariate-adaptive procedures controlling the familywise error rate remain underdeveloped. Here, we propose a novel covariate-adaptive procedure to control the familywise error rate that incorporates external covariates which are potentially informative of either the statistical power or the prior null probability. An efficient algorithm is developed to implement the proposed method. We prove its asymptotic validity and obtain the rate of convergence through a perturbation-type argument. Our numerical studies show that the new procedure is more powerful than competing methods and maintains robustness across different settings. We apply the proposed approach to the UK Biobank data and analyse 27 traits with 9 million single-nucleotide polymorphisms tested for associations. Seventy-five genomic annotations are used as covariates. Our approach detects more genome-wide significant loci than other methods in 21 out of the 27 traits. 
    more » « less
  5. Earthquakes in stable salt domes are few, with a notable increase in the rate of seismicity prior to catastrophic events, such as the collapse of salt caverns used to store hydrocarbons. Cavern collapse, subsequent gas leakage, and the formation of sinkholes pose a significant hazard for local communities, given that they can disrupt normal societal functions, have various socio-economic impacts, and may result in the evacuation of residents. In Louisiana, one such event was the Bayou Corne collapse in 2012. Following reports of unusual ground tremors, we began monitoring seismicity at the Sorrento salt dome in February 2020. The goal of this study is to improve our understanding of the subsurface processes and their impact on the mechanical integrity of salt domes; we do this by examining the spatio-temporal evolution of the seismicity. We deployed an ~5 km x 4 km nodal array of 12-17 stations, with interstation distances of 0.2 km to 1.9 km, across the dome and recorded eight months of data that were sampled at 500 Hz. Sorrento dome events are usually low in magnitude, often with emergent P-wave onsets, as well as P-waves shrouded in the coda of preceding events, during swarms. Such characteristics make the events difficult to identify using standard automatic detection and location procedures. We first evaluate current methods using an STA/LTA algorithm, coincidence event detectors, and pre-trained, deep-learning detectors and pickers. We find that detection of consistent P-wave phases across several stations for the same event is challenging and poses a major problem for event association and location. To address this problem, we initiate a manual review of all initial event associations to eliminate false positives that could incorrectly inflate the number of events in the catalog. We, therefore, developed a custom-trained detector and picker that outperformed other methods, and it identified multiple events that were recorded by >70% of the stations in the array. Our approach is well-suited for identifying events with emergent P-wave onsets and short durations (~2-10 s), and our method correctly identified a spike in seismicity in the days leading up to a well failure at the dome. Our methodology can be easily adapted for similar types of studies, such as volcano, mine and dam monitoring, and geothermal exploration. 
    more » « less