Regression testing is increasingly important with the wide use of continuous integration. A desirable requirement for regression testing is that a test failure reliably indicates a problem in the code under test and not a false alarm from the test code or the testing infrastructure. However, some test failures are unreliable, stemming from flaky tests that can non- deterministically pass or fail for the same code under test. There are many types of flaky tests, with order-dependent tests being a prominent type. To help advance research on flaky tests, we present (1) a framework, iDFlakies, to detect and partially classify flaky tests; (2) a dataset of flaky tests in open-source projects; and (3) a study with our dataset. iDFlakies automates experimentation with our tool for Maven-based Java projects. Using iDFlakies, we build a dataset of 422 flaky tests, with 50.5% order-dependent and 49.5% not. Our study of these flaky tests finds the prevalence of two types of flaky tests, probability of a test-suite run to have at least one failure due to flaky tests, and how different test reorderings affect the number of detected flaky tests. We envision that our work can spur research to alleviate the problem of flakymore »
Finding Polluter Tests Using Java PathFinder
Tests that modify (i.e., "pollute") the state shared among tests in a test suite are called \polluter tests". Finding these tests is im- portant because they could result in di erent test outcomes based on the order of the tests in the test suite. Prior work has proposed the PolDet technique for nding polluter tests in runs of JUnit tests on a regular Java Virtual Machine (JVM). Given that Java PathFinder (JPF) provides desirable infrastructure support, such as systematically exploring thread schedules, it is a worthwhile attempt to re-implement techniques such as PolDet in JPF. We present a new implementation of PolDet for nding polluter tests in runs of JUnit tests in JPF. We customize the existing state comparison in JPF to support the so-called \common-root iso- morphism" required by PolDet. We find that our implementation is simple, requiring only -200 lines of code, demonstrating that JPF is a sophisticated infrastructure for rapid exploration of re-search ideas on software testing. We evaluate our implementation on 187 test classes from 13 Java projects and nd 26 polluter tests. Our results show that the runtime overhead of PolDet@JPF com- pared to base JPF is relatively low, on average 1.43x. However, our experiments more »
- Award ID(s):
- 1816615
- Publication Date:
- NSF-PAR ID:
- 10299902
- Journal Name:
- ACM SIGSOFT Software Engineering Notes
- Volume:
- 46
- Issue:
- 3
- Page Range or eLocation-ID:
- 37 to 41
- ISSN:
- 0163-5948
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code under test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.
-
The compilation scheme for Volatile accesses in the OpenJDK 9 HotSpot Java Virtual Machine has a major problem that persists despite a recent bug report and a long discussion. One of the suggested fixes is to let Java compile Volatile accesses in the same way as C/C++11. However, we show that this approach is invalid for Java. Indeed, we show a set of optimizations that is valid for C/C++11 but invalid for Java, while the compilation scheme is similar. We prove the correctness of the compilation scheme to Power and x86 and a suite of valid optimizations in Java. Our proofs are based on a language model that we validate by proving key properties such as the DRF-SC theorem and by running litmus tests via our implementation of Java in Herd7.
-
Flaky tests are a source of frustration and uncertainty for developers. In an educational environment, flaky tests can create doubts related to software behavior and student grades, especially when the grades depend on tests passing. NC State University's junior-level software engineering course models industrial practice through team-based development and testing of new features on a large electronic health record (EHR) system, iTrust2. Students are expected to maintain and supplement an extensive suite of UI tests using Selenium WebDriver. Team builds are run on the course's continuous integration (CI) infrastructure. Students report, and we confirm, that tests that pass on one build will inexplicably fail on the next, impacting productivity and confidence in code quality and the CI system. The goal of this work is to find and fix the sources of flaky tests in iTrust2. We analyze configurations of Selenium using different underlying web browsers and timeout strategies (waits) for both test stability and runtime performance. We also consider underlying hardware and operating systems. Our results show that HtmlUnit with Thread waits provides the lowest number of test failures and best runtime on poor-performing hardware. When given more resources (e.g., more memory and a faster CPU), Google Chrome with Angularmore »
-
Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »