skip to main content

This content will become publicly available on July 23, 2024

Title: Inference at Scale: Significance Testing for Large Search and Recommendation Experiments
A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-\(N\) recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR). 
    more » « less
  2. Abstract Purpose The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies. Design/methodology/approach The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his or her first name (e.g., a search for ‘John Doe’ would assume the form: ‘Doe, J’). Results for this search typically contain all of the scholarship legitimately belonging to this author in the given database (i.e., all of his or her true positives), along with a large amount of noise, or scholarship not belonging to this author (i.e., a large number of false positives). From this corpus we proceed to iteratively weed out false positives and retain true positives. Author identifiers provide a good starting point—e.g., if ‘Doe, J’ and ‘Doe, John’ share the same author identifier, this would be sufficient for us to conclude these are one and the same individual. We find email addresses similarly adequate—e.g., if two author names which share the same surname and same first initial have an email address in common, we conclude these authors are the same person. Author identifier and email address data is not always available, however. When this occurs, other fields are used to address the author uncertainty problem. Commonalities among author data other than unique identifiers and email addresses is less conclusive for name consolidation purposes. For example, if ‘Doe, John’ and ‘Doe, J’ have an affiliation in common, do we conclude that these names belong the same person? They may or may not; affiliations have employed two or more faculty members sharing the same last and first initial. Similarly, it’s conceivable that two individuals with the same last name and first initial publish in the same journal, publish with the same co-authors, and/or cite the same references. Should we then ignore commonalities among these fields and conclude they’re too imprecise for name consolidation purposes? It is our position that such commonalities are indeed valuable for addressing the author uncertainty problem, but more so when used in combination. Our approach makes use of automation as well as manual inspection, relying initially on author identifiers, then commonalities among fielded data other than author identifiers, and finally manual verification. To achieve name consolidation independent of author identifier matches, we have developed a procedure that is used with bibliometric software called VantagePoint (see While the application of our technique does not exclusively depend on VantagePoint, it is the software we find most efficient in this study. The script we developed to implement this procedure is designed to implement our name disambiguation procedure in a way that significantly reduces manual effort on the user’s part. Those who seek to replicate our procedure independent of VantagePoint can do so by manually following the method we outline, but we note that the manual application of our procedure takes a significant amount of time and effort, especially when working with larger datasets. Our script begins by prompting the user for a surname and a first initial (for any author of interest). It then prompts the user to select a WOS field on which to consolidate author names. After this the user is prompted to point to the name of the authors field, and finally asked to identify a specific author name (referred to by the script as the primary author) within this field whom the user knows to be a true positive (a suggested approach is to point to an author name associated with one of the records that has the author’s ORCID iD or email address attached to it). The script proceeds to identify and combine all author names sharing the primary author’s surname and first initial of his or her first name who share commonalities in the WOS field on which the user was prompted to consolidate author names. This typically results in significant reduction in the initial dataset size. After the procedure completes the user is usually left with a much smaller (and more manageable) dataset to manually inspect (and/or apply additional name disambiguation techniques to). Research limitations Match field coverage can be an issue. When field coverage is paltry dataset reduction is not as significant, which results in more manual inspection on the user’s part. Our procedure doesn’t lend itself to scholars who have had a legal family name change (after marriage, for example). Moreover, the technique we advance is (sometimes, but not always) likely to have a difficult time dealing with scholars who have changed careers or fields dramatically, as well as scholars whose work is highly interdisciplinary. Practical implications The procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research, especially when the name under consideration is a more common family name. It is more effective when match field coverage is high and a number of match fields exist. Originality/value Once again, the procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research. It combines preexisting with more recent approaches, harnessing the benefits of both. Findings Our study applies the name disambiguation procedure we advance to three case studies. Ideal match fields are not the same for each of our case studies. We find that match field effectiveness is in large part a function of field coverage. Comparing original dataset size, the timeframe analyzed for each case study is not the same, nor are the subject areas in which they publish. Our procedure is more effective when applied to our third case study, both in terms of list reduction and 100% retention of true positives. We attribute this to excellent match field coverage, and especially in more specific match fields, as well as having a more modest/manageable number of publications. While machine learning is considered authoritative by many, we do not see it as practical or replicable. The procedure advanced herein is both practical, replicable and relatively user friendly. It might be categorized into a space between ORCID and machine learning. Machine learning approaches typically look for commonalities among citation data, which is not always available, structured or easy to work with. The procedure we advance is intended to be applied across numerous fields in a dataset of interest (e.g. emails, coauthors, affiliations, etc.), resulting in multiple rounds of reduction. Results indicate that effective match fields include author identifiers, emails, source titles, co-authors and ISSNs. While the script we present is not likely to result in a dataset consisting solely of true positives (at least for more common surnames), it does significantly reduce manual effort on the user’s part. Dataset reduction (after our procedure is applied) is in large part a function of (a) field availability and (b) field coverage. 
    more » « less
  3. Scholkopf, Bernhard ; Uhler, Caroline ; Zhang, Kun (Ed.)
    In order to test if a treatment is perceptibly different from a placebo in a randomized experiment with covariates, classical nonparametric tests based on ranks of observations/residuals have been employed (eg: by Rosenbaum), with finite-sample valid inference enabled via permutations. This paper proposes a different principle on which to base inference: if — with access to all covariates and outcomes, but without access to any treatment assignments — one can form a ranking of the subjects that is sufficiently nonrandom (eg: mostly treated followed by mostly control), then we can confidently conclude that there must be a treatment effect. Based on a more nuanced, quantifiable, version of this principle, we design an interactive test called i-bet: the analyst forms a single permutation of the subjects one element at a time, and at each step the analyst bets toy money on whether that subject was actually treated or not, and learns the truth immediately after. The wealth process forms a real-valued measure of evidence against the global causal null, and we may reject the null at level if the wealth ever crosses 1= . Apart from providing a fresh “game-theoretic” principle on which to base the causal conclusion, the i-bet has other statistical and computational benefits, for example (A) allowing a human to adaptively design the test statistic based on increasing amounts of data being revealed (along with any working causal models and prior knowledge), and (B) not requiring permutation resampling, instead noting that under the null, the wealth forms a nonnegative martingale, and the type-1 error control of the aforementioned decision rule follows from a tight inequality by Ville. Further, if the null is not rejected, new subjects can later be added and the test can be simply continued, without any corrections (unlike with permutation p-values). Numerical experiments demonstrate good power under various heterogeneous treatment effects. We first describe i-bet test for two-sample comparisons with unpaired data, and then adapt it to paired data, multi-sample comparison, and sequential settings; these may be viewed as interactive martingale variants of the Wilcoxon, Kruskal-Wallis, and Friedman tests. 
    more » « less
  4. Summary Identifying dependency in multivariate data is a common inference task that arises in numerous applications. However, existing nonparametric independence tests typically require computation that scales at least quadratically with the sample size, making it difficult to apply them in the presence of massive sample sizes. Moreover, resampling is usually necessary to evaluate the statistical significance of the resulting test statistics at finite sample sizes, further worsening the computational burden. We introduce a scalable, resampling-free approach to testing the independence between two random vectors by breaking down the task into simple univariate tests of independence on a collection of $2\times 2$ contingency tables constructed through sequential coarse-to-fine discretization of the sample , transforming the inference task into a multiple testing problem that can be completed with almost linear complexity with respect to the sample size. To address increasing dimensionality, we introduce a coarse-to-fine sequential adaptive procedure that exploits the spatial features of dependency structures. We derive a finite-sample theory that guarantees the inferential validity of our adaptive procedure at any given sample size. We show that our approach can achieve strong control of the level of the testing procedure at any sample size without resampling or asymptotic approximation and establish its large-sample consistency. We demonstrate through an extensive simulation study its substantial computational advantage in comparison to existing approaches while achieving robust statistical power under various dependency scenarios, and illustrate how its divide-and-conquer nature can be exploited to not just test independence, but to learn the nature of the underlying dependency. Finally, we demonstrate the use of our method through analysing a dataset from a flow cytometry experiment. 
    more » « less
  5. INTRODUCTION: In practice, the use of a whip stitch versus a locking stitch in anterior cruciate ligament (ACL) graft preparation is based on surgeon preference. Those who prefer efficiency and shorter stitch time typically choose a whip stitch, while those who require improved biomechanical properties select a locking stitch, the gold standard of which is the Krackow method. The purpose of this study was to evaluate a novel suture needle design that can be used to perform two commonly used stitch methods, a whip stitch, and a locking stitch, by comparing the speed of graft preparation and biomechanical properties. It was hypothesized that adding a locking mechanism to the whip stitch would improve biomechanical performance but would also require more time to complete due to additional steps required for the locking technique. METHODS: Graft preparation was performed by four orthopaedic surgeons of different training levels where User 1 and User 2 were both attendings and User’s 3 and 4 were both fellows. A total of 24 matched pair cadaveric knees were dissected and a total of 48 semitendinosus tendons were harvested. All grafts were standardized to the same size. Tendons were randomly divided into 4 groups (12 tendons per group) such that each User performed analogous stitch on matched pair: Group 1, User 1 and User 3 performed whip stitches; Group 2, User 1 and User 3 performed locking stitches; Group 3, User 2 and User 4 performed whip stitches; Group 4, User 2 and User 4 performed locking stitches. For instrumentation, the two ends of tendon grafts were clamped to a preparation stand. A skin marker was used to mark five evenly spaced points, 0.5 cm apart, as a guide to create a 5-stitch series. The stitches were performed with EasyWhip, a novel two-part suture needle which allows one to do both a traditional whip stitch and a locking whip stitch, referred to as WhipLock (Figure 1). The speed for graft preparation was timed for each User. Biomechanical testing was performed using a servohydraulic testing machine (MTS Bionix) equipped with a 5kN load cell (Figure 2). A standardized length of tendon, 10 cm, was coupled to the MTS actuator by passing it through a cryoclamp cooled by dry ice to a temperature of -5°C. All testing samples were pre-conditioned to normalize viscoelastic effects and testing variability through application of cyclical loading to 25-100 N for three cycles. The samples were then held at 89 N for 15 minutes. Thereafter, the samples were loaded to 50-200 N for 500 cycles at 1 Hz. If samples survived, they were ramped to failure at 20 mm/min. Displacement and force data was collected throughout testing. Metrics of interest were peak-to-peak displacement (mm), stiffness (N/mm), ultimate failure load (N) and failure mode. Data are presented as averages and standard deviations. A Wilcoxon signed-rank test was used to evaluate the groups for time to complete stitch and biomechanical performance. Statistical significance was set at P = .05. RESULTS SECTION: In Group 1, the time to complete the whip stitch was not significantly different between User 1 and User 3, where the average completion time was 1 min 13 sec. Similarly, there were no differences between Users when performing the WhipLock (Group 2) with an average time of 1 min 49 sec. In Group 3 (whip stitch), User 2 took 1 min 48 sec to complete the whip stitch, whereas User 4 took 1 min 25 sec (p=.033). The time to complete the WhipLock stitch (Group 4) was significantly different, where User 2 took 3 min and 44 sec, while User 4 only took 2 min 3 sec (p=.002). Overall, the whip stitch took on average 1 min 25 sec whereas the WhipLock took 2 min 20 sec (p=.001). For whip stitch constructs, no differences were found between Users and all stitches were biomechanically equivalent. Correspondingly, for WhipLock stitches, no differences were found between Users and all suture constructs were likewise biomechanically equivalent. Averages for peak-to-peak displacement (mm), stiffness (N/mm), and ultimate failure load (N) are presented in Table 1. Moreover, when the two stitch methods were compared, the WhipLock constructs significantly increased stiffness by 25% (p <.001), increased ultimate failure load by 35% (p<.001) and reduced peak-to-peak displacement by 55% (p=.001). The common mode of failure for grafts with whip stitch failed by suture pullout from tendon (18/24), where a few instances occurred by suture breakage (6/24). Tendon grafts with WhipLock stitch commonly failed by suture breakage (22/24), where two instances of combined tendon tear and suture breakage were noted. DISCUSSION: The WhipLock stitch significantly increased average construct stiffness and ultimate failure load, while significantly reducing the peak-to- peak displacement compared to the whip stitch. These added strength benefits of the WhipLock stitch took 55 seconds more to complete than the whip stitch. No statistically significant difference in biomechanical performance was found between the Users. Data suggests equivalent stitch performance regardless of the time to complete stitch and surgeon training level. SIGNIFICANCE/CLINICAL RELEVANCE: Clinically, having a suture needle device available which can be used to easily perform different constructs including one with significant strength advantages regardless of level of experience is of benefit. 
    more » « less