skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: More Than One Replication Study Is Needed for Unambiguous Tests of Replication
The problem of assessing whether experimental results can be replicated is becoming increasingly important in many areas of science. It is often assumed that assessing replication is straightforward: All one needs to do is repeat the study and see whether the results of the original and replication studies agree. This article shows that the statistical test for whether two studies obtain the same effect is smaller than the power of either study to detect an effect in the first place. Thus, unless the original study and the replication study have unusually high power (e.g., power of 98%), a single replication study will not have adequate sensitivity to provide an unambiguous evaluation of replication.  more » « less
Award ID(s):
1841075
PAR ID:
10173459
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of Educational and Behavioral Statistics
Volume:
44
Issue:
5
ISSN:
1076-9986
Page Range / eLocation ID:
543 to 570
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract. The concept of replication is fundamental to the logic and rhetoric of science, including the argument that science is self-correcting. Yet there is very little literature on the methodology of replication. In this article, I argue that the definition of replication should not require underlying effects to be identical, but should permit some variation in true effects to be allowed. I note that different possible analyses could be used to determine whether studies replicate. Finally, I argue that a single replication study is almost never adequate to determine whether a result replicates. Thus, methodological work on the design of replication studies would be useful. 
    more » « less
  2. Abstract Empirical evaluations of replication have become increasingly common, but there has been no unified approach to doing so. Some evaluations conduct only a single replication study while others run several, usually across multiple laboratories. Designing such programs has largely contended with difficult issues about which experimental components are necessary for a set of studies to be considered replications. However, another important consideration is that replication studies be designed to support sufficiently sensitive analyses. For instance, if hypothesis tests are to be conducted about replication, studies should be designed to ensure these tests are well-powered; if not, it can be difficult to determine conclusively if replication attempts succeeded or failed. This paper describes methods for designing ensembles of replication studies to ensure that they are both adequately sensitive and cost-efficient. It describes two potential analyses of replication studies—hypothesis tests and variance component estimation—and approaches to obtaining optimal designs for them. Using these results, it assesses the statistical power, precision of point estimators and optimality of the design used by the Many Labs Project and finds that while it may have been sufficiently powered to detect some larger differences between studies, other designs would have been less costly and/or produced more precise estimates or higher-powered hypothesis tests. 
    more » « less
  3. Crowdsourcing and data mining can be used to effectively reduce the effort associated with the partial replication and enhancement of qualitative studies. For example, in a primary study, other researchers explored factors influencing the fate of GitHub pull requests using an extensive qualitative analysis of 20 pull requests. Guided by their findings, we mapped some of their qualitative insights onto quantitative questions. To determine how well their findings generalize, we collected much more data (170 additional pull requests from 142 GitHub projects). Using crowdsourcing, that data was augmented with subjective qualitative human opinions about how pull requests extended the original issue. The crowd’s answers were then combined with quantitative features and, using data mining, used to build a predictor for whether code would be merged. That predictor was far more accurate than the one built from the primary study’s qualitative factors (F1=90 vs 68%), illustrating the value of a mixed-methods approach and replication to improve prior results. To test the generality of this approach, the next step in future work is to conduct other studies that extend qualitative studies with crowdsourcing and data mining. 
    more » « less
  4. Data visualizations typically show a representation of a data set with little to no focus on the repeatability or generalizability of the displayed trends and patterns. However, insights gleaned from these visualizations are often used as the basis for decisions about future events. Visualizations of retrospective data therefore often serve as “visual predictive models.” However, this visual predictive model approach can lead to invalid inferences. In this article, we describe an approach to visual model validation called Inline Replication. Inline Replication is closely related to the statistical techniques of bootstrap sampling and cross-validation and, like those methods, provides a non-parametric and broadly applicable technique for assessing the variance of findings from visualizations. This article describes the overall Inline Replication process and outlines how it can be integrated into both traditional and emerging “big data” visualization pipelines. It also provides examples of how Inline Replication can be integrated into common visualization techniques such as bar charts and linear regression lines. Results from an empirical evaluation of the technique and two prototype Inline Replication–based visual analysis systems are also described. The empirical evaluation demonstrates the impact of Inline Replication under different conditions, showing that both (1) the level of partitioning and (2) the approach to aggregation have a major influence over its behavior. The results highlight the trade-offs in choosing Inline Replication parameters but suggest that using [Formula: see text] partitions is a reasonable default. 
    more » « less
  5. Abstract Over the past decades, bilingualism researchers have come to a consensus around a fairly strong view of nonselectivity in bilingual speakers, often citing Van Hell and Dijkstra (2002) as a critical piece of support for this position. Given the study’s continuing relevance to bilingualism and its strong test of the influence of a bilingual’s second language on their first language, we conducted an approximate replication of the lexical decision experiments in the original study (Experiments 2 and 3) using the same tasks and—to the extent possible—the same stimuli. Unlike the original study, our replication was conducted online with Dutch–English bilinguals (rather than in a lab with Dutch–English–French trilinguals). Despite these differences, results overall closely replicated the pattern of cognate facilitation effects observed in the original study. We discuss the replication of outcomes and possible interpretations of subtle differences in outcomes and make recommendations for future extensions of this line of research. 
    more » « less