skip to main content


Title: Comparing and combining analysis-based and learning-based regression test selection
Regression testing---rerunning tests on each code version to detect newly-broken functionality---is important and widely practiced. But, regression testing is costly due to the large number of tests and the high frequency of code changes. Regression test selection (RTS) optimizes regression testing by only rerunning a subset of tests that can be affected by changes. Researchers showed that RTS based on program analysis can save substantial testing time for (medium-sized) open-source projects. Practitioners also showed that RTS based on machine learning (ML) works well on very large code repositories, e.g., in Facebook's monorepository. We combine analysis-based RTS and ML-based RTS by using the latter to choose a subset of tests selected by the former. We first train several novel ML models to learn the impact of code changes on test outcomes using a training dataset that we obtain via mutation analysis. Then, we evaluate the benefits of combining ML models with analysis-based RTS on 10 projects, compared with using each technique alone. Combining ML-based RTS with two analysis-based RTS techniques-Ekstazi and STARTS-selects 25.34% and 21.44% fewer tests, respectively.  more » « less
Award ID(s):
2019277 2045596 1652517
NSF-PAR ID:
10366111
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
IEEE/ACM International Conference on Automation of Software Test
Page Range / eLocation ID:
17 to 28
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Regression testing - rerunning tests at each code version to detect newly-broken functionality - is important and widely practiced. But regression testing is costly due to the large number of tests and high frequency of code changes. Regression test selection (RTS) optimizes regression testing by rerunning only a subset of tests that can be affected by code changes. Researchers showed that RTS based on dynamic and static program analysis can save substantial testing time for (medium-sized) open-source projects. Simultaneously, practitioners showed that RTS based on machine learning (ML) is lightweight and works well on very large software repositories, e.g., in Facebook’s monorepository. We combine analysis-based RTS and ML-based RTS by using ML-based RTS to choose a subset of tests selected by analysis-based RTS. To do so, we first design several novel ML-based RTS techniques that leverage mutation analysis to obtain a training set for learning the impact of code changes on test outcomes. Then, we empirically evaluate, using 10 projects, the benefits of combining various ML models with analysis-based RTS. We also compare combining the techniques with using each technique individually. Combining ML-based RTS with two analysis-based RTS techniques - Ekstazi and STARTS - selects 25.34% and 21.44% fewer tests. 
    more » « less
  2. Regression test selection (RTS) speeds up regression testing by only re-running tests that might be affected by code changes. Ideal RTS safely selects all affected tests and precisely selects only affected tests. But, aiming for this ideal is often slower than re-running all tests. So, recent RTS techniques use program analysis to trade precision for speed, i.e., lower regression testing time, or even use machine learning to trade safety for speed. We seek to make recent analysis-based RTS techniques more precise, to further speed up regression testing. Independent studies suggest that these techniques reached a “performance wall” in the speed-ups that they provide. We manually inspect code changes to discover those that do not require re-running tests that are only affected by such changes. We categorize 29 kinds of changes that we find from five projects into 13 findings, 11 of which are semantics-modifying. We enhance two RTS techniques—Ekstazi and STARTS—to reason about our findings. Using 1,150 versions of 23 projects, we evaluate the impact on safety and precision of leveraging such changes. We also evaluate if our findings from a few projects can speed up regression testing in other projects. The results show that our enhancements are effective and they can generalize. On average, they result in selecting 41.7% and 31.8% fewer tests, and take 33.7% and 28.7% less time than Ekstazi and STARTS, respectively, with no loss in safety. 
    more » « less
  3. Regression test selection (RTS) speeds up regression testing by only re-running tests that might be affected by code changes. Ideal RTS safely selects all affected tests and precisely selects only affected tests. But, aiming for this ideal is often slower than re-running all tests. So, recent RTS techniques use program analysis to trade precision for speed, i.e., lower regression testing time, or even use machine learning to trade safety for speed. We seek to make recent analysis-based RTS techniques more precise, to further speed up regression testing. Independent studies suggest that these techniques reached a “performance wall” in the speed-ups that they provide. We manually inspect code changes to discover those that do not require re-running tests that are only affected by such changes. We categorize 29 kinds of changes that we found from five projects into 13 findings, 11 of which are semantics-modifying. We enhance two RTS techniques—Ekstazi and STARTS—to reason about our findings. Using 1,150 versions of 23 projects, we evaluate the impact on safety and precision of leveraging such changes. We also evaluate if our findings from a few projects can speed up regression testing in other projects. The results show that our enhancements are effective and they can generalize. On average, they result in selecting 41.7% and 31.8% fewer tests, and take 33.7% and 28.7% less time than Ekstazi and STARTS, respectively, with no loss in safety. 
    more » « less
  4. Regression testing - running available tests after each project change - is widely practiced in industry. Despite its widespread use and importance, regression testing is a costly activity. Regression test selection (RTS) optimizes regression testing by selecting only tests affected by project changes. RTS has been extensively studied and several tools have been deployed in large projects. However, work on RTS over the last decade has mostly focused on languages with abstract computing machines(e.g., JVM). Meanwhile development practices (e.g., frequency of commits, testing frameworks, compilers) in C++ projects have dramatically changed and the way we should design and implement RTS tools and the benefits of those tools is unknown. We present a design and implementation of an RTS technique, dubbed RTS++, that targets projects written in C++, which compile to LLVM IR and use the Google Test testing framework. RTS++ uses static analysis of a function call graph to select tests. RTS++ integrates with many existing build systems, including AutoMake, CMake, and Make. We evaluated RTS++ on 11 large open-source projects, totaling 3,811,916 lines of code. To the best of our knowledge, this is the largest evaluation of an RTS technique for C++. We measured the benefits of RTS++compared to running all available tests (i.e., retest-all). Our results show that RTS++ reduces the number of executed tests and end-to-end testing time by 88% and 61% on average. 
    more » « less
  5. When developers make changes to their code, they typically run regression tests to detect if their recent changes (re)introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled at least as many tests as flaky as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives (an increase in precision from just 11% to 60%). This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. By investigating the information gain of each feature, we conclude that test execution time, overall test coverage, coverage of recently changed lines and usage of third party libraries are effective predictors of test flakiness. We did not find any keywords or tokens in the source code of tests that were effective in predicting test flakiness, and did not find the presence of test smells to be effective in predicting test flakiness.

    This archive contains the dataset that we collected of flaky tests, along with the features that we collected from each test.

    Contents:
    Project_Info.csv: List of projects and their revisions studied
    build-logs-<project-slug>.tgz: An archive of all of the maven build logs from each of the 10,000 runs of that project's test suite. 
    failing-test-reports-<project-slug>.tgz An archive of all of the surefire XML reports for each failing test of each build of each project.
    test_results.csv: Summary of the number of passing and failing runs for each test in each project. 
    "Run ID" is a key into the <project-slug>.tgz archive also in this artifact, which refers to the run that we observed the test fail on.
    test_features.csv: Summary of the features that each test had, as per our feature detectors described in the paper
    flakeflagger-code.zip: All scripts used to generate and process these results. These scripts are also located at https://github.com/AlshammariA/FlakeFlagger

     
    more » « less