skip to main content

Title: Themis: Automatically testing software for discrimination
Bias in decisions made by modern software is becoming a common and serious problem. We present Themis, an automated test suite generator to measure two types of discrimination, including causal relationships between sensitive inputs and program behavior. We explain how Themis can measure discrimination and aid its debugging, describe a set of optimizations Themis uses to reduce test suite size, and demonstrate Themis' effectiveness on open-source software. Themis is open-source and all our evaluation data are available at See a video of Themis in action:
; ; ;
Award ID(s):
1744471 1763423 1453474 1453543
Publication Date:
Journal Name:
Proceedings of the Demonstrations Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)
Page Range or eLocation-ID:
871 to 875
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper defines the notions of software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates. Specifically, the paper focuses on measuring causality in discriminatory behavior. Modern software contributes to important societal decisions and evidence of software discrimination has been found in systems that recommend criminal sentences, grant access to financial loans and products, and determine who is allowed to participate in promotions and receive services. Our approach, Themis, measures discrimination in software by generating efficient, discrimination-testing test suites. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and, notably,more »does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.« less
  2. Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail when run on the mutant. Mutation testing implicitly assumes that tests exhibit deterministic behavior, in terms of their coverage and the outcome of a test (not) killing a certain mutant. Such an assumption does not hold in the presence of flaky tests, whose outcomes can non-deterministically differ even when run on the same code undermore »test. Without reliable test outcomes, mutation testing can result in unreliable results, e.g., in our experiments, mutation scores vary by four percentage points on average between repeated executions, and 9% of mutant-test pairs have an unknown status. Many modern software projects suffer from flaky tests. We propose techniques that manage flakiness throughout the mutation testing process, largely based on strategically re-running tests. We implement our techniques by modifying the open-source mutation testing tool, PIT. Our evaluation on 30 projects shows that our techniques reduce the number of "unknown" (flaky) mutants by 79.4%.« less
  3. We present an implementation of the trimmed serendipity finite element family, using the open-source finite element package Firedrake. The new elements can be used seamlessly within the software suite for problems requiring H 1 , H (curl), or H (div)-conforming elements on meshes of squares or cubes. To test how well trimmed serendipity elements perform in comparison to traditional tensor product elements, we perform a sequence of numerical experiments including the primal Poisson, mixed Poisson, and Maxwell cavity eigenvalue problems. Overall, we find that the trimmed serendipity elements converge, as expected, at the same rate as the respective tensor productmore »elements, while being able to offer significant savings in the time or memory required to solve certain problems.« less
  4. Infant behavior, like all behavior, is the aggregate product of many nested processes operating and interacting over multiple time scales; the result of a tangle of inter-related causes and effects. Efforts in identifying the mechanisms supporting infant behavior require the development and advancement of new technologies that can accurately and densely capture behavior’s multiple branches. The present study describes an open-source, wireless autonomic vest specifically designed for use in infants 8–24 months of age in order to measure cardiac activity, respiration, and movement. The schematics of the vest, instructions for its construction, and a suite of software designed for itsmore »use are made freely available. While the use of such autonomic measures has many applications across the field of developmental psychology, the present article will present evidence for the validity of the vest in three ways: (1) by demonstrating known clinical landmarks of a heartbeat, (2) by demonstrating an infant in a period of sustained attention, a well-documented behavior in the developmental psychology literature, and (3) relating changes in accelerometer output to infant behavior.« less
  5. Recent scientific computing increasingly relies on multi-scale multi-physics simulations to enhance predictive capabilities by replacing a suite of stand-alone simulation codes that independently simulate various physical phenomena. Inevitably, multi-physics simulation demands high performance computing (HPC) through advanced hardware and software accelerating due to its intensive computing workload and run-time communication needs. Thus, its research has become a hotspot across different disciplines. However, it is observed that most benchmarks used in the evaluation of corresponding work are through commercial or in-house codes. Then, the lack of accessible open-source multi-physics benchmark suites has presented a challenge in uniformly evaluating simulation performance acrossmore »related disciplines. This work proposes the first open-source based benchmark suite with 12 selected benchmarks for research in multi-physics simulation, the Clarkson Open-Source Multi-physics Benchmark Suite (COMBS). Multiple metrics have been gathered for these benchmarks, such as instructions per second and memory usage. Also provided are build and benchmark scripts to improve usability. Additionally, their source codes and installation guides are available for downloading through a github repository built by the authors. The selected benchmarks are from key applications of multi-physics simulation and highly cited publications. It is believed that this benchmark suite will facilitate to harness the full potential of HPC research in the field of multi-physics simulation.« less