skip to main content


Title: Generating Effective Test Suites by Combining Coverage Criteria
A number of criteria have been proposed to judge test suite adequacy. While search-based test generation has improved greatly at criteria coverage, the produced suites are still often ineffective at detecting faults. Efficacy may be limited by the single-minded application of one criterion at a time when generating suites - a sharp contrast to human testers, who simultaneously explore multiple testing strategies. We hypothesize that automated generation can be improved by selecting and simultaneously exploring multiple criteria. To address this hypothesis, we have generated multi-criteria test suites, measuring efficacy against the Defects4J fault database. We have found that multi-criteria suites can be up to 31.15% more effective at detecting complex, real-world faults than suites generated to satisfy a single criterion and 70.17% more effective than the default combination of all eight criteria. Given a fixed search budget, we recommend pairing a criterion focused on structural exploration - such as Branch Coverage - with targeted supplemental strategies aimed at the type of faults expected from the system under test. Our findings offer lessons to consider when selecting such combinations.  more » « less
Award ID(s):
1657299
PAR ID:
10047322
Author(s) / Creator(s):
Date Published:
Journal Name:
International Symposium on Search Based Software Engineering
Page Range / eLocation ID:
65-82
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While adequacy criteria offer an end-point for testing, they do not mandate how targets are covered. Branch Coverage may be attained through direct calls to methods, or through indirect calls between methods. Automated generation is biased towards the rapid gains offered by indirect coverage. Therefore, even with the same end-goal, humans and automation produce very different tests. Direct coverage may yield tests that are more understandable, and that detect faults missed by traditional approaches. However, the added burden for the generation framework may result in lower coverage and faults that emerge through method interactions may be missed. To compare the two approaches, we have generated test suites for both, judging efficacy against real faults. We have found that requiring direct coverage results in lower achieved coverage and likelihood of fault detection. However, both forms of Branch Coverage cover code and detect faults that the other does not. By isolating methods, Direct Branch Coverage is less constrained in the choice of input. However, traditional Branch Coverage is able to leverage method interactions to discover faults. Ultimately, both are situationally applicable within the context of a broader testing strategy. 
    more » « less
  2. Summary

    Search‐based unit test generation, if effective at fault detection, can lower the cost of testing. Such techniques rely on fitness functions to guide the search. Ultimately, such functions represent test goals that approximate—but do not ensure—fault detection. The need to rely on approximations leads to two questions—can fitness functions produce effective tests and, if so, which should be used to generate tests?To answer these questions, we have assessed the fault‐detection capabilities of unit test suites generated to satisfy eight white‐box fitness functions on 597 real faults from the Defects4J database. Our analysis has found that the strongest indicators of effectiveness are a high level of code coverage over the targeted class and high satisfaction of a criterion's obligations. Consequently, the branch coverage fitness function is the most effective. Our findings indicate that fitness functions that thoroughly explore system structure should be used as primary generation objectives—supported by secondary fitness functions that explore orthogonal, supporting scenarios. Our results also provide further evidence that future approaches to test generation should focus on attaining higher coverage of private code and better initialization and manipulation of class dependencies.

     
    more » « less
  3. Dozens of criteria have been proposed to judge testing adequacy. Such criteria are important, as they guide automated generation efforts. Yet, the current use of such criteria in automated generation contrasts how such criteria are used by humans. For a human, coverage is part of a multifaceted combination of testing strategies. In automated generation, coverage is typically the goal, and a single fitness function is applied at one time. We propose that the key to improving the fault detection efficacy of search-based test generation approaches lies in a targeted, multifaceted approach pairing primary fitness functions that effectively explore the structure of the class under test with lightweight supporting fitness functions that target particular scenarios likely to trigger an observable failure. This report summarizes our findings to date, details the hypothesis of primary and supporting fitness functions, and identifies outstanding research challenges related to multifaceted test suite generation. We hope to inspire new advances in search-based test generation that could benefit our software-powered society. 
    more » « less
  4. Search-based test generation is guided by feedback from one or more fitness functions—scoring functions that judge solution optimality. Choosing informative fitness functions is crucial to meeting the goals of a tester. Unfortunately, many goals—such as forcing the class-under-test to throw exceptions— do not have a known fitness function formulation. We propose that meeting such goals requires treating fitness function identification as a secondary optimization step. An adaptive algorithm that can vary the selection of fitness functions could adjust its selection throughout the generation process to maximize goal attainment, based on the current population of test suites. To test this hypothesis, we have implemented two reinforcement learning algorithms in the EvoSuite framework, and used these algorithms to dynamically set the fitness functions used during generation. We have evaluated our framework, EvoSuiteFIT, on a set of 386 real faults. EvoSuiteFIT discovers and retains more exception-triggering input and produces suites that detect a variety of faults missed by the other techniques. The ability to adjust fitness functions allows EvoSuiteFIT to make strategic choices that efficiently produce more effective test suites. 
    more » « less
  5. Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage and mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage. 
    more » « less