skip to main content


Title: Mitigating the effects of flaky tests on mutation testing
Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.  more » « less
Award ID(s):
1763822 1763788
NSF-PAR ID:
10104340
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ISSTA 2019 Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis
Page Range / eLocation ID:
112 to 122
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Regression testing is increasingly important with the wide use of continuous integration. A desirable requirement for regression testing is that a test failure reliably indicates a problem in the code under test and not a false alarm from the test code or the testing infrastructure. However, some test failures are unreliable, stemming from flaky tests that can non- deterministically pass or fail for the same code under test. There are many types of flaky tests, with order-dependent tests being a prominent type. To help advance research on flaky tests, we present (1) a framework, iDFlakies, to detect and partially classify flaky tests; (2) a dataset of flaky tests in open-source projects; and (3) a study with our dataset. iDFlakies automates experimentation with our tool for Maven-based Java projects. Using iDFlakies, we build a dataset of 422 flaky tests, with 50.5% order-dependent and 49.5% not. Our study of these flaky tests finds the prevalence of two types of flaky tests, probability of a test-suite run to have at least one failure due to flaky tests, and how different test reorderings affect the number of detected flaky tests. We envision that our work can spur research to alleviate the problem of flaky tests. 
    more » « less
  2. When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled at least as many tests as flaky as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives (an increase in precision from just 11% to 60%). This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. By investigating the information gain of each feature, we conclude that test execution time, overall test coverage, coverage of recently changed lines and usage of third party libraries are effective predictors of test flakiness. We did not find any keywords or tokens in the source code of tests that were effective in predicting test flakiness, and did not find the presence of test smells to be effective in predicting test flakiness.

    This archive contains the dataset that we collected of flaky tests, along with the features that we collected from each test.

    Contents:
    Project_Info.csv: List of projects and their revisions studied
    build-logs-<project-slug>.tgz: An archive of all of the maven build logs from each of the 10,000 runs of that project's test suite. 
    failing-test-reports-<project-slug>.tgz An archive of all of the surefire XML reports for each failing test of each build of each project.
    test_results.csv: Summary of the number of passing and failing runs for each test in each project. 
    "Run ID" is a key into the <project-slug>.tgz archive also in this artifact, which refers to the run that we observed the test fail on.
    test_features.csv: Summary of the features that each test had, as per our feature detectors described in the paper
    flakeflagger-code.zip: All scripts used to generate and process these results. These scripts are also located at https://github.com/AlshammariA/FlakeFlagger

     
    more » « less
  3. Mutation testing is a powerful technique for evaluating the quality of test suite which plays a key role in ensuring software quality. The concept of mutation testing has also been widely used in other software engineering studies, e.g., test generation, fault localization, and program repair. During the process of mutation testing, large number of mutants may be generated and then executed against the test suite to examine whether they can be killed, making the process extremely computational expensive. Several techniques have been proposed to speed up this process, including selective, weakened, and predictive mutation testing. Among those techniques, Predictive Mutation Testing (PMT) tries to build a classification model based on an amount of mutant execution records to predict whether coming new mutants would be killed or alive without mutant execution, and can achieve significant mutation cost reduction. In PMT, each mutant is represented as a list of features related to the mutant itself and the test suite, transforming the mutation testing problem to a binary classification problem. In this paper, we perform an extensive study on the effectiveness and efficiency of the promising PMT technique under the cross-project setting using a total 654 real world projects with more than 4 Million mutants. Our work also complements the original PMT work by considering more features and the powerful deep learning models. The experimental results show an average of over 0.85 prediction accuracy on 654 projects using cross validation, demonstrating the effectiveness of PMT. Meanwhile, a clear speed up is also observed with an average of 28.7× compared to traditional mutation testing with 5 threads. In addition, we analyze the importance of different groups of features in classification model, which provides important implications for the future research. 
    more » « less
  4. Flaky tests are a source of frustration and uncertainty for developers. In an educational environment, flaky tests can create doubts related to software behavior and student grades, especially when the grades depend on tests passing. NC State University's junior-level software engineering course models industrial practice through team-based development and testing of new features on a large electronic health record (EHR) system, iTrust2. Students are expected to maintain and supplement an extensive suite of UI tests using Selenium WebDriver. Team builds are run on the course's continuous integration (CI) infrastructure. Students report, and we confirm, that tests that pass on one build will inexplicably fail on the next, impacting productivity and confidence in code quality and the CI system. The goal of this work is to find and fix the sources of flaky tests in iTrust2. We analyze configurations of Selenium using different underlying web browsers and timeout strategies (waits) for both test stability and runtime performance. We also consider underlying hardware and operating systems. Our results show that HtmlUnit with Thread waits provides the lowest number of test failures and best runtime on poor-performing hardware. When given more resources (e.g., more memory and a faster CPU), Google Chrome with Angular waits is less flaky and faster than HtmlUnit, especially if the browser instance is not restarted between tests. The outcomes of this research are a more stable and substantially faster teaching application and a recommendation on how to configure Selenium for applications similar to iTrust2 that run in a CI environment. 
    more » « less
  5. While the existing methods for testing XACML policies have varying levels of effectiveness, none of them can reveal the majority of policy faults. The undisclosed faults may lead to unauthorized access and denial of service. This paper presents an approach to strong mutation testing of XACML policies that automatically generates tests from the mutants of a given policy. Such mutants represent the targeted faults that may appear in the policy. In this approach, we first compose the strong mutation constraints that capture the semantic difference between each mutant and its original policy. Then, we use a constraint solver to derive an access request (i.e., test). The test suite generated from all the mutants of a policy can achieve a perfect mutation score, thus uncover all hypothesized faults or demonstrate their absence. Based on the mutation-based approach, this paper further explores optimal test suite that achieves a perfect mutation score without duplicate tests. To evaluate the proposed approach, our experiments have included all the subject policies in the relevant literature and used a number of new policies. The results demonstrate that: (1) it is scalable to generate a mutation-based test suite to achieve a perfect mutation score, (2) it can be impractical to generate the optimal test suite due to the expensive removal of duplicate tests, (3) different from the results of the existing study, the modified-condition/decision coverage-based method, currently the most effective one, has low mutation scores for several policies. 
    more » « less