skip to main content


Title: iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests
Regression testing is increasingly important with the wide use of continuous integration. A desirable requirement for regression testing is that a test failure reliably indicates a problem in the code under test and not a false alarm from the test code or the testing infrastructure. However, some test failures are unreliable, stemming from flaky tests that can non- deterministically pass or fail for the same code under test. There are many types of flaky tests, with order-dependent tests being a prominent type. To help advance research on flaky tests, we present (1) a framework, iDFlakies, to detect and partially classify flaky tests; (2) a dataset of flaky tests in open-source projects; and (3) a study with our dataset. iDFlakies automates experimentation with our tool for Maven-based Java projects. Using iDFlakies, we build a dataset of 422 flaky tests, with 50.5% order-dependent and 49.5% not. Our study of these flaky tests finds the prevalence of two types of flaky tests, probability of a test-suite run to have at least one failure due to flaky tests, and how different test reorderings affect the number of detected flaky tests. We envision that our work can spur research to alleviate the problem of flaky tests.  more » « less
Award ID(s):
1839010 1763788
NSF-PAR ID:
10101224
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proc. of the 12th IEEE International Conference on Software Testing, Verification and Validation (ICST 2019)
Page Range / eLocation ID:
312 to 322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled at least as many tests as flaky as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives (an increase in precision from just 11% to 60%). This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. By investigating the information gain of each feature, we conclude that test execution time, overall test coverage, coverage of recently changed lines and usage of third party libraries are effective predictors of test flakiness. We did not find any keywords or tokens in the source code of tests that were effective in predicting test flakiness, and did not find the presence of test smells to be effective in predicting test flakiness.

    This archive contains the dataset that we collected of flaky tests, along with the features that we collected from each test.

    Contents:
    Project_Info.csv: List of projects and their revisions studied
    build-logs-<project-slug>.tgz: An archive of all of the maven build logs from each of the 10,000 runs of that project's test suite. 
    failing-test-reports-<project-slug>.tgz An archive of all of the surefire XML reports for each failing test of each build of each project.
    test_results.csv: Summary of the number of passing and failing runs for each test in each project. 
    "Run ID" is a key into the <project-slug>.tgz archive also in this artifact, which refers to the run that we observed the test fail on.
    test_features.csv: Summary of the features that each test had, as per our feature detectors described in the paper
    flakeflagger-code.zip: All scripts used to generate and process these results. These scripts are also located at https://github.com/AlshammariA/FlakeFlagger

     
    more » « less
  2. Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%. 
    more » « less
  3. This artifact contains the source code for FlakeRake, a tool for automatically reproducing timing-dependent flaky-test failures. It also includes raw and processed results produced in the evaluation of FlakeRake

     

    Contents:

     

    Timing-related APIs that FlakeRake considers adding sleeps at: timing-related-apis

    Anonymized code for FlakeRake (not runnable in its anonymized state, but included for reference; we will publicly release the non-anonymized code under an open source license pending double-blind review): flakerake.tgz

    Failure messages extracted from the FlakeFlagger dataset: 10k_reruns_failures_by_test.csv.gz 

    Output from running isolated reruns on each flaky test in the FlakeFlager dataset: 10k_isolated_reruns_all_results.csv.gz (all test results summarized into a CSV), 10k_isolated_reruns_failures_by_test.csv.gz (CSV including just test failures, including failure messages), 10k_isolated_reruns_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven)

    Output from running the FlakeFlagger replication study (non-isolated 10k reruns):flakeFlaggerReplResults.csv.gz (all test results summarized into a CSV), 10k_reruns_failures_by_test.csv.gz (CSV including just failures, including failure messages), flakeFlaggerRepl_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven - this file is markedly larger than the 10k isolated reruns results because we ran *all* tests in this experiment, whereas the 10k isolated rerun experiment only re-ran the tests that were known to be flaky from the FlakeFlagger dataset).

    Output from running FlakeRake on each flaky test in the FlakeFlagger dataset:

    For bisection mode: results-bis.tgz

    For one-by-one mode: results-obo.tgz

    Scripts used to execute FlakeRake using an HPC cluster: execution-scripts.tgz
    Scripts used to execute rerun experiments using an HPC cluster: flakeFlaggerReplScripts.tgz
    Scripts used to parse the "raw" maven test result XML files in this artifact into the CSV files contained in this artifact: parseSurefireXMLs.tgz 

    Output from running FlakeRake in “reproduction” mode, attempting to reproduce each of the failures that matched the FlakeFlagger dataset (collected for bisection mode only): results-repro-bis.tgz

    Analysis of timing-dependent API calls in the failure inducing configurations that matched FlakeFlagger failures: bis-sleepyline.cause-to-matched-fail-configs-found.csv

     
    more » « less
  4. Test-case prioritization (TCP) aims to detect regression bugs faster via reordering the tests run. While TCP has been studied for over 20 years, it was almost always evaluated using seeded faults/mutants as opposed to using real test failures. In this work, we study the recent change-aware information retrieval (IR) technique for TCP. Prior work has shown it performing better than traditional coverage-based TCP techniques, but it was only evaluated on a small-scale dataset with a cost-unaware metric based on seeded faults/mutants. We extend the prior work by conducting a much larger and more realistic evaluation as well as proposing enhancements that substantially improve the performance. In particular, we evaluate the original technique on a large-scale, real-world software-evolution dataset with real failures using both cost-aware and cost-unaware metrics under various configurations. Also, we design and evaluate hybrid techniques combining the IR features, historical test execution time, and test failure frequencies. Our results show that the change-aware IR technique outperforms stateof-the-art coverage-based techniques in this real-world setting, and our hybrid techniques improve even further upon the original IR technique. Moreover, we show that flaky tests have a substantial impact on evaluating the change-aware TCP techniques based on real test failures. 
    more » « less
  5. Regression testing---rerunning tests on each code version to detect newly-broken functionality---is important and widely practiced. But, regression testing is costly due to the large number of tests and the high frequency of code changes. Regression test selection (RTS) optimizes regression testing by only rerunning a subset of tests that can be affected by changes. Researchers showed that RTS based on program analysis can save substantial testing time for (medium-sized) open-source projects. Practitioners also showed that RTS based on machine learning (ML) works well on very large code repositories, e.g., in Facebook's monorepository. We combine analysis-based RTS and ML-based RTS by using the latter to choose a subset of tests selected by the former. We first train several novel ML models to learn the impact of code changes on test outcomes using a training dataset that we obtain via mutation analysis. Then, we evaluate the benefits of combining ML models with analysis-based RTS on 10 projects, compared with using each technique alone. Combining ML-based RTS with two analysis-based RTS techniques-Ekstazi and STARTS-selects 25.34% and 21.44% fewer tests, respectively. 
    more » « less