Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.
more »
« less
An Extensive Study on Cross-Project Predictive Mutation Testing
Mutation testing is a powerful technique for evaluating the quality of test suite which plays a key role in ensuring software quality. The concept of mutation testing has also been widely used in other software engineering studies, e.g., test generation, fault localization, and program repair. During the process of mutation testing, large number of mutants may be generated and then executed against the test suite to examine whether they can be killed, making the process extremely computational expensive. Several techniques have been proposed to speed up this process, including selective, weakened, and predictive mutation testing. Among those techniques, Predictive Mutation Testing (PMT) tries to build a classification model based on an amount of mutant execution records to predict whether coming new mutants would be killed or alive without mutant execution, and can achieve significant mutation cost reduction. In PMT, each mutant is represented as a list of features related to the mutant itself and the test suite, transforming the mutation testing problem to a binary classification problem. In this paper, we perform an extensive study on the effectiveness and efficiency of the promising PMT technique under the cross-project setting using a total 654 real world projects with more than 4 Million mutants. Our work also complements the original PMT work by considering more features and the powerful deep learning models. The experimental results show an average of over 0.85 prediction accuracy on 654 projects using cross validation, demonstrating the effectiveness of PMT. Meanwhile, a clear speed up is also observed with an average of 28.7× compared to traditional mutation testing with 5 threads. In addition, we analyze the importance of different groups of features in classification model, which provides important implications for the future research.
more »
« less
- Award ID(s):
- 1763906
- 10111193
- Date Published:
- Journal Name:
- IEEE Conference on Software Testing, Validation and Verification (ICST)
- Page Range / eLocation ID:
- 160 to 171
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
While the existing methods for testing XACML policies have varying levels of effectiveness, none of them can reveal the majority of policy faults. The undisclosed faults may lead to unauthorized access and denial of service. This paper presents an approach to strong mutation testing of XACML policies that automatically generates tests from the mutants of a given policy. Such mutants represent the targeted faults that may appear in the policy. In this approach, we first compose the strong mutation constraints that capture the semantic difference between each mutant and its original policy. Then, we use a constraint solver to derive an access request (i.e., test). The test suite generated from all the mutants of a policy can achieve a perfect mutation score, thus uncover all hypothesized faults or demonstrate their absence. Based on the mutation-based approach, this paper further explores optimal test suite that achieves a perfect mutation score without duplicate tests. To evaluate the proposed approach, our experiments have included all the subject policies in the relevant literature and used a number of new policies. The results demonstrate that: (1) it is scalable to generate a mutation-based test suite to achieve a perfect mutation score, (2) it can be impractical to generate the optimal test suite due to the expensive removal of duplicate tests, (3) different from the results of the existing study, the modified-condition/decision coverage-based method, currently the most effective one, has low mutation scores for several policies.more » « less
Actor concurrency is becoming increasingly important in the real world and mission-critical software. This requires these applications to be free from actor bugs, that occur in the real world, and have tests that are effective in finding these bugs. Mutation testing is a well-established technique that transforms an application to induce its likely bugs and evaluate the effectiveness of its tests in finding these bugs. Mutation testing is available for a broad spectrum of applications and their bugs, ranging from web to mobile to machine learning, and is used at scale in companies like Google and Facebook. However, there still is no mutation testing for actor concurrency that uses real-world actor bugs. In this paper, we propose 𝜇Akka, a framework for mutation testing of Akka actor concurrency using real actor bugs. Akka is a popular industrial-strength implementation of actor concurrency. To design, implement, and evaluate 𝜇Akka, we take the following major steps: (1) manually analyze a recent set of 186 real Akka bugs from Stack Overflow and GitHub to understand their causes; (2) design a set of 32 mutation operators, with 138 source code changes in Akka API, to emulate these causes and induce their bugs; (3) implement these operators in an Eclipse plugin for Java Akka; (4) use the plugin to generate 11.7k mutants of 10 real GitHub applications, with 446.4k lines of code and 7.9k tests; (5) run these tests on these mutants to measure the quality of mutants and effectiveness of tests; (6) use PIT to generate 26.2k mutants to compare 𝜇Akka and PIT mutant quality and test effectiveness. PIT is a popular mutation testing tool with traditional operators; (7) manually analyze the bug coverage and overlap of 𝜇Akka, PIT, and actor operators in a previous work; and (8) discuss a few implications of our findings. Among others, we find that 𝜇Akka mutants are higher quality, cover more bugs, and tests are less effective in detecting them.more » « less
Several metrics have been proposed in the past to quantify the effectiveness of a test suite; they are usually based on some measure of coverage because it is sensible to quantify the effectiveness of a test suite by the extent to which it exercises (covers) various syntactic features of the program under test. Though no coverage metric has emerged as the gold standard of test suite effectiveness, mutation coverage is widely perceived as a reliable measure of test suite effectiveness because the ability of a test suite to detect program mutations can be used as an indication of its ability to detect actual faults. In this paper we aim to challenge the superiority of mutation coverage, by showing that the same test suite may have vastly different values of mutation coverage depending on the mutation operators that are used in the estimation.more » « less
Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage and mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage.more » « less