skip to main content


Title: Do Pseudo Test Suites Lead to Inflated Correlation in Measuring Test Effectiveness?
Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage and mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage.  more » « less
Award ID(s):
1763906
NSF-PAR ID:
10111191
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
IEEE Conference on Software Testing, Validation and Verification (ICST)
Page Range / eLocation ID:
252 to 263
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Regression test selection (RTS) approaches reduce the cost of regression testing of evolving software systems. Existing RTS approaches based on UML models use behavioral diagrams or a combination of structural and behavioral diagrams. However, in practice, behavioral diagrams are incomplete or not used. In previous work, we proposed a fuzzy logic based RTS approach called FLiRTS that uses UML sequence and activity diagrams. In this work, we introduce FLiRTS 2, which drops the need for behavioral diagrams and relies on system models that only use UML class diagrams, which are the most widely used UML diagrams in practice. FLiRTS 2 addresses the unavailability of behavioral diagrams by classifying test cases using fuzzy logic after analyzing the information commonly provided in class diagrams. We evaluated FLiRTS 2 on UML class diagrams extracted from 3331 revisions of 13 open-source software systems, and compared the results with those of code-based dynamic (Ekstazi) and static (STARTS) RTS approaches. The average test suite reduction using FLiRTS 2 was 82.06%. The average safety violations of FLiRTS 2 with respect to Ekstazi and STARTS were 18.88% and 16.53%, respectively. FLiRTS 2 selected on average about 82% of the test cases that were selected by Ekstazi and STARTS. The average precision violations of FLiRTS 2 with respect to Ekstazi and STARTS were 13.27% and 9.01%, respectively. The average mutation score of the full test suites was 18.90%; the standard deviation of the reduced test suites from the average deviation of the mutation score for each subject was 1.78% for FLiRTS 2, 1.11% for Ekstazi, and 1.43% for STARTS. Our experiment demonstrated that the performance of FLiRTS 2 is close to the state-of-art tools for code-based RTS but requires less information and performs the selection in less time. 
    more » « less
  2. While the existing methods for testing XACML policies have varying levels of effectiveness, none of them can reveal the majority of policy faults. The undisclosed faults may lead to unauthorized access and denial of service. This paper presents an approach to strong mutation testing of XACML policies that automatically generates tests from the mutants of a given policy. Such mutants represent the targeted faults that may appear in the policy. In this approach, we first compose the strong mutation constraints that capture the semantic difference between each mutant and its original policy. Then, we use a constraint solver to derive an access request (i.e., test). The test suite generated from all the mutants of a policy can achieve a perfect mutation score, thus uncover all hypothesized faults or demonstrate their absence. Based on the mutation-based approach, this paper further explores optimal test suite that achieves a perfect mutation score without duplicate tests. To evaluate the proposed approach, our experiments have included all the subject policies in the relevant literature and used a number of new policies. The results demonstrate that: (1) it is scalable to generate a mutation-based test suite to achieve a perfect mutation score, (2) it can be impractical to generate the optimal test suite due to the expensive removal of duplicate tests, (3) different from the results of the existing study, the modified-condition/decision coverage-based method, currently the most effective one, has low mutation scores for several policies. 
    more » « less
  3. Summary

    Search‐based unit test generation, if effective at fault detection, can lower the cost of testing. Such techniques rely on fitness functions to guide the search. Ultimately, such functions represent test goals that approximate—but do not ensure—fault detection. The need to rely on approximations leads to two questions—can fitness functions produce effective tests and, if so, which should be used to generate tests?To answer these questions, we have assessed the fault‐detection capabilities of unit test suites generated to satisfy eight white‐box fitness functions on 597 real faults from the Defects4J database. Our analysis has found that the strongest indicators of effectiveness are a high level of code coverage over the targeted class and high satisfaction of a criterion's obligations. Consequently, the branch coverage fitness function is the most effective. Our findings indicate that fitness functions that thoroughly explore system structure should be used as primary generation objectives—supported by secondary fitness functions that explore orthogonal, supporting scenarios. Our results also provide further evidence that future approaches to test generation should focus on attaining higher coverage of private code and better initialization and manipulation of class dependencies.

     
    more » « less
  4. Test adequacy criteria are widely used to guide test creation. However, many of these criteria are sensitive to statement structure or the choice of test oracle. This is because such criteria ensure that execution reaches the element of interest, but impose no constraints on the execution path after this point. We are not guaranteed to observe a failure just because a fault is triggered. To address this issue, we have proposed the concept of observability—an extension to coverage criteria based on Boolean expressions that combines the obligations of a host criterion with an additional path condition that increases the likelihood that a fault encountered will propagate to a monitored variable. Our study, conducted over five industrial systems and an additional forty open-source systems, has revealed that adding observability tends to improve efficacy over satisfaction of the traditional criteria, with average improvements of 125.98% in mutation detection with the common output-only test oracle and per-model improvements of up to 1760.52%. Ultimately, there is merit to our hypothesis—observability reduces sensitivity to the choice of oracle and to the program structure. 
    more » « less
  5. A number of criteria have been proposed to judge test suite adequacy. While search-based test generation has improved greatly at criteria coverage, the produced suites are still often ineffective at detecting faults. Efficacy may be limited by the single-minded application of one criterion at a time when generating suites - a sharp contrast to human testers, who simultaneously explore multiple testing strategies. We hypothesize that automated generation can be improved by selecting and simultaneously exploring multiple criteria. To address this hypothesis, we have generated multi-criteria test suites, measuring efficacy against the Defects4J fault database. We have found that multi-criteria suites can be up to 31.15% more effective at detecting complex, real-world faults than suites generated to satisfy a single criterion and 70.17% more effective than the default combination of all eight criteria. Given a fixed search budget, we recommend pairing a criterion focused on structural exploration - such as Branch Coverage - with targeted supplemental strategies aimed at the type of faults expected from the system under test. Our findings offer lessons to consider when selecting such combinations. 
    more » « less