skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Conceptual Framework for Ethical Evaluation of Machine Learning Systems
Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations.  more » « less
Award ID(s):
2229873
PAR ID:
10573583
Author(s) / Creator(s):
; ;
Publisher / Repository:
AAAI/ACM
Date Published:
Journal Name:
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
Volume:
7
ISSN:
3065-8365
Page Range / eLocation ID:
534 to 546
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as "melt"), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results. 
    more » « less
  2. Societal Impact Statement It is increasingly common for plant scientists and urban planning and design professionals to collaborate on interdisciplinary teams that integrate scientific experiments into public and social urban spaces. However, neither the procedural ethics that govern scientific experimentation, nor the professional ethics of urban design and planning practice, fully account for the possible impacts of urban ecological experiments on local residents and communities. Scientists that participate in design and planning teams act as decision‐makers, and must expand their domain of ethical consideration accordingly. Conversely, practitioners who engage in ecological experiments take on the moral responsibilities inherent in generation of knowledge. To avoid potential harm to human and non‐human inhabitants of cities while maintaining scientific and professional integrity in research and practice, an integrated ethical framework is needed for urban ecological planning and design. SummaryWhile there are many ethical and procedural guidelines for scientists who wish to inform decision‐making and public policy, urban ecologists are increasingly embedded in planning and design teams to integrate scientific measurements and experiments into urban landscapes. These scientists are not just informing decision‐making – they are themselves acting as decision‐makers. As such, researchers take on additional moral obligations beyond scientific procedural ethics when designing and conducting ecological design and planning experiments. We describe the growing field of urban ecological design and planning and present a framework for expanding the ethical considerations of socioecological researchers and urban practitioners who collaborate on interdisciplinary teams. Drawing on existing ethical frameworks from a range of disciplines, we outline possible ways in which ecologists, social scientists, and practitioners should expand the traditional ethical considerations of their work to ensure that urban residents, communities, and non‐human entities are not harmed as researchers and practitioners carry out their individual obligations to clients, municipalities, and scientific practice. We present an integrated framework to aid in the development of ethical codes for research, practice, and education in integrated urban ecology, socioenvironmental sciences, and design and planning. 
    more » « less
  3. Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows. 
    more » « less
  4. Despite the availability of numerous automatic accessibility testing solutions, web accessibility issues persist on many websites. Moreover, there is a lack of systematic evaluations of the efficacy of current accessibility testing tools. To address this gap, we present the first mutation analysis framework, called Ma11y, designed to assess web accessibility testing tools. Ma11y includes 25 mutation operators that intentionally violate various accessibility principles and an automated oracle to determine whether a mutant is detected by a testing tool. Evaluation on real-world websites demonstrates the practical applicability of the mutation operators and the framework’s capacity to assess tool performance. Our results demonstrate that the current tools cannot identify nearly 50% of the accessibility bugs injected by our framework, thus underscoring the need for the development of more effective accessibility testing tools. Finally, the framework’s accuracy and performance attest to its potential for seamless and automated application in practical settings 
    more » « less
  5. Peer evaluations are critical for assessing teams, but are susceptible to bias and other factors that undermine their reliability. At the same time, collaborative tools that teams commonly use to perform their work are increasingly capable of logging activity that can signal useful information about individual contributions and teamwork. To investigate current and potential uses for activity traces in peer evaluation tools, we interviewed (N=11) and surveyed (N=242) students and interviewed (N=10) instructors at a single university. We found that nearly all of the students surveyed considered specific contributions to the team outcomes when evaluating their teammates, but also reported relying on memory and subjective experiences to make the assessment. Instructors desired objective sources of data to address challenges with administering and interpreting peer evaluations, and have already begun incorporating activity traces from collaborative tools into their evaluations of teams. However, both students and instructors expressed concern about using activity traces due to the diverse ecosystem of tools and platforms used by teams and the limited view into the context of the contributions. Based on our findings, we contribute recommendations and a speculative design for a data-centric peer evaluation tool. 
    more » « less