skip to main content


Title: Generating Effective Test Suites by Combining Coverage Criteria
A number of criteria have been proposed to judge test suite adequacy. While search-based test generation has improved greatly at criteria coverage, the produced suites are still often ineffective at detecting faults. Efficacy may be limited by the single-minded application of one criterion at a time when generating suites - a sharp contrast to human testers, who simultaneously explore multiple testing strategies. We hypothesize that automated generation can be improved by selecting and simultaneously exploring multiple criteria. To address this hypothesis, we have generated multi-criteria test suites, measuring efficacy against the Defects4J fault database. We have found that multi-criteria suites can be up to 31.15% more effective at detecting complex, real-world faults than suites generated to satisfy a single criterion and 70.17% more effective than the default combination of all eight criteria. Given a fixed search budget, we recommend pairing a criterion focused on structural exploration - such as Branch Coverage - with targeted supplemental strategies aimed at the type of faults expected from the system under test. Our findings offer lessons to consider when selecting such combinations.  more » « less
Award ID(s):
1657299
NSF-PAR ID:
10047322
Author(s) / Creator(s):
Date Published:
Journal Name:
International Symposium on Search Based Software Engineering
Page Range / eLocation ID:
65-82
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While adequacy criteria offer an end-point for testing, they do not mandate how targets are covered. Branch Coverage may be attained through direct calls to methods, or through indirect calls between methods. Automated generation is biased towards the rapid gains offered by indirect coverage. Therefore, even with the same end-goal, humans and automation produce very different tests. Direct coverage may yield tests that are more understandable, and that detect faults missed by traditional approaches. However, the added burden for the generation framework may result in lower coverage and faults that emerge through method interactions may be missed. To compare the two approaches, we have generated test suites for both, judging efficacy against real faults. We have found that requiring direct coverage results in lower achieved coverage and likelihood of fault detection. However, both forms of Branch Coverage cover code and detect faults that the other does not. By isolating methods, Direct Branch Coverage is less constrained in the choice of input. However, traditional Branch Coverage is able to leverage method interactions to discover faults. Ultimately, both are situationally applicable within the context of a broader testing strategy. 
    more » « less
  2. Summary

    Search‐based unit test generation, if effective at fault detection, can lower the cost of testing. Such techniques rely on fitness functions to guide the search. Ultimately, such functions represent test goals that approximate—but do not ensure—fault detection. The need to rely on approximations leads to two questions—can fitness functions produce effective tests and, if so, which should be used to generate tests?To answer these questions, we have assessed the fault‐detection capabilities of unit test suites generated to satisfy eight white‐box fitness functions on 597 real faults from the Defects4J database. Our analysis has found that the strongest indicators of effectiveness are a high level of code coverage over the targeted class and high satisfaction of a criterion's obligations. Consequently, the branch coverage fitness function is the most effective. Our findings indicate that fitness functions that thoroughly explore system structure should be used as primary generation objectives—supported by secondary fitness functions that explore orthogonal, supporting scenarios. Our results also provide further evidence that future approaches to test generation should focus on attaining higher coverage of private code and better initialization and manipulation of class dependencies.

     
    more » « less
  3. Dozens of criteria have been proposed to judge testing adequacy. Such criteria are important, as they guide automated generation efforts. Yet, the current use of such criteria in automated generation contrasts how such criteria are used by humans. For a human, coverage is part of a multifaceted combination of testing strategies. In automated generation, coverage is typically the goal, and a single fitness function is applied at one time. We propose that the key to improving the fault detection efficacy of search-based test generation approaches lies in a targeted, multifaceted approach pairing primary fitness functions that effectively explore the structure of the class under test with lightweight supporting fitness functions that target particular scenarios likely to trigger an observable failure. This report summarizes our findings to date, details the hypothesis of primary and supporting fitness functions, and identifies outstanding research challenges related to multifaceted test suite generation. We hope to inspire new advances in search-based test generation that could benefit our software-powered society. 
    more » « less
  4. Search-based test generation is guided by feedback from one or more fitness functions—scoring functions that judge solution optimality. Choosing informative fitness functions is crucial to meeting the goals of a tester. Unfortunately, many goals—such as forcing the class-under-test to throw exceptions— do not have a known fitness function formulation. We propose that meeting such goals requires treating fitness function identification as a secondary optimization step. An adaptive algorithm that can vary the selection of fitness functions could adjust its selection throughout the generation process to maximize goal attainment, based on the current population of test suites. To test this hypothesis, we have implemented two reinforcement learning algorithms in the EvoSuite framework, and used these algorithms to dynamically set the fitness functions used during generation. We have evaluated our framework, EvoSuiteFIT, on a set of 386 real faults. EvoSuiteFIT discovers and retains more exception-triggering input and produces suites that detect a variety of faults missed by the other techniques. The ability to adjust fitness functions allows EvoSuiteFIT to make strategic choices that efficiently produce more effective test suites. 
    more » « less
  5. null (Ed.)
    Background: Post-operative delirium is a common complication among adult patients in the intensive care unit. Current literature does not support the use of pharmacologic measures to manage this condition, and several studies explore the potential for the use of non-pharmacologic methods such as early mobility plans or environmental modifications. The aim of this systematic review is to examine and report on recently available literature evaluating the relationship between non-pharmacologic management strategies and the reduction of delirium in the intensive care unit. Methods: Six major research databases were systematically searched for articles analyzing the efficacy of non-pharmacologic delirium interventions in the past five years. Search results were restricted to adult human patients aged 18 years or older in the intensive care unit setting, excluding terminally ill subjects and withdrawal-related delirium. Following title, abstract, and full text review, 27 articles fulfilled the inclusion criteria and are included in this report. Results: The 27 reviewed articles consist of 12 interventions with a single-component investigational approach, and 15 with multi-component bundled protocols. Delirium incidence was the most commonly assessed outcome followed by duration. Family visitation was the most effective individual intervention while mobility interventions were the least effective. Two of the three family studies significantly reduced delirium incidence, while one in five mobility studies did the same. Multi-component bundle approaches were the most effective of all; of the reviewed studies, eight of 11 bundles significantly improved delirium incidence and seven of eight bundles decreased the duration of delirium. Conclusions: Multi-component, bundled interventions were more effective at managing intensive care unit delirium than those utilizing an approach with a single interventional element. Although better management of this condition suggests a decrease in resource burden and improvement in patient outcomes, comparative research should be performed to identify the importance of specific bundle elements. 
    more » « less