skip to main content

Title: Fairness Testing: Testing Software for Discrimination
This paper defines the notions of software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates. Specifically, the paper focuses on measuring causality in discriminatory behavior. Modern software contributes to important societal decisions and evidence of software discrimination has been found in systems that recommend criminal sentences, grant access to financial loans and products, and determine who is allowed to participate in promotions and receive services. Our approach, Themis, measures discrimination in software by generating efficient, discrimination-testing test suites. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and, notably, does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle more » in domains with possible discrimination and provide initial tools for measuring software discrimination. « less
Authors:
; ;
Award ID(s):
1453543 1744471
Publication Date:
NSF-PAR ID:
10033437
Journal Name:
Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering
Page Range or eLocation-ID:
498–510
Sponsoring Org:
National Science Foundation
More Like this
  1. Bias in decisions made by modern software is becoming a common and serious problem. We present Themis, an automated test suite generator to measure two types of discrimination, including causal relationships between sensitive inputs and program behavior. We explain how Themis can measure discrimination and aid its debugging, describe a set of optimizations Themis uses to reduce test suite size, and demonstrate Themis' effectiveness on open-source software. Themis is open-source and all our evaluation data are available at http://fairness.cs.umass.edu/. See a video of Themis in action: https://youtu.be/brB8wkaUesY
  2. A number of criteria have been proposed to judge test suite adequacy. While search-based test generation has improved greatly at criteria coverage, the produced suites are still often ineffective at detecting faults. Efficacy may be limited by the single-minded application of one criterion at a time when generating suites - a sharp contrast to human testers, who simultaneously explore multiple testing strategies. We hypothesize that automated generation can be improved by selecting and simultaneously exploring multiple criteria. To address this hypothesis, we have generated multi-criteria test suites, measuring efficacy against the Defects4J fault database. We have found that multi-criteria suitesmore »can be up to 31.15% more effective at detecting complex, real-world faults than suites generated to satisfy a single criterion and 70.17% more effective than the default combination of all eight criteria. Given a fixed search budget, we recommend pairing a criterion focused on structural exploration - such as Branch Coverage - with targeted supplemental strategies aimed at the type of faults expected from the system under test. Our findings offer lessons to consider when selecting such combinations.« less
  3. Code coverage is the most widely adopted criteria for measuring test effectiveness in software quality assurance. The performance of coverage criteria (in indicating test suites' effectiveness) has been widely studied in prior work. Most of the studies use randomly constructed pseudo test suites to facilitate data collection for correlation analysis, yet no previous work has systematically studied whether pseudo test suites would lead to inflated correlation results. This paper focuses on the potentially wide-spread threat with a study over 123 real-world Java projects. Following the typical experimental process of studying coverage criteria, we investigate the correlation between statement/assertion coverage andmore »mutation score using both pseudo and original test suites. Except for direct correlation analysis, we control the number of assertions and the test suite size to conduct partial correlation analysis. The results reveal that 1) the correlation (between coverage criteria and mutation score) derived from pseudo test suites is much higher than from original test suites (from 0.21 to 0.39 higher in Kendall value); 2) contrary to previously reported, statement coverage has a stronger correlation with mutation score than assertion coverage.« less
  4. In modern software development, software libraries play a crucial role in reducing software development effort and improving software quality. However, at the same time, the asynchronous upgrades of software libraries and client software projects often result in incompatibilities between different versions of libraries and client projects. When libraries evolve, it is often very challenging for library developers to maintain the so-called backward compatibility and keep all their external behavior untouched, and behavioral backward incompatibilities (BBIs) may occur. In practice, the regression test suites of library projects often fail to detect all BBIs. Therefore, in this paper, we propose DeBBI tomore »detect BBIs via cross-project testing and analysis, i.e., using the test suites of various client projects to detect library BBIs. Since executing all the possible client projects can be extremely time consuming, DeBBI transforms the problem of cross-project BBI detection into a traditional information retrieval (IR) problem to execute the client projects with higher probability to detect BBIs earlier. Furthermore, DeBBI considers project diversity and test relevance information for even faster BBI detection. The experimental results show that DeBBI can reduce the end-to-end testing time for detecting the first and average unique BBIs by 99.1% and 70.8% for JDK compared to naive cross-project BBI detection. Also, DeBBI has been applied to other popular 3rd-party libraries. To date, DeBBI has detected 97 BBI bugs with 19 already confirmed as previously unknown bugs.« less
  5. Testing scientific software is a difficult task due to their inherent complexity and the lack of test oracles. In addition, these software systems are usually developed by end-user developers who are not normally trained as professional software developers nor testers. These factors often lead to inadequate testing. Metamorphic testing (MT) is a simple yet effective testing technique for testing such applications. Even though MT is a well known technique in the software testing community, it is not very well utilized by the scientific software developers. The objective of this paper is to present MT as an effective technique for testingmore »scientific software. To this end, we discuss why MT is an appropriate testing technique for scientists and engineers who are not primarily trained as software developers. Specifically, how it can be used to conduct systematic and effective testing on programs that do not have test oracles without requiring additional testing tools.« less