skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 11, 2026

Title: Evaluation and Incident Prevention in an Enterprise AI Assistant
Enterprise AI Assistants are increasingly deployed in domains where accuracy is paramount, making each erroneous output a potentially significant incident. This paper presents a comprehensive framework for monitoring, benchmarking, and continuously improving such complex, multi-component systems under active development by multiple teams. Our approach encompasses three key elements: (1) a hierarchical ``severity'' framework for incident detection that identifies and categorizes errors while attributing component-specific error rates, facilitating targeted improvements; (2) a scalable and principled methodology for benchmark construction, evaluation, and deployment, designed to accommodate multiple development teams, mitigate overfitting risks, and assess the downstream impact of system modifications; and (3) a continual improvement strategy leveraging multidimensional evaluation, enabling the identification and implementation of diverse enhancement opportunities. By adopting this holistic framework, organizations can systematically enhance the reliability and performance of their AI Assistants, ensuring their efficacy in critical enterprise environments. We conclude by discussing how this multifaceted evaluation approach opens avenues for various classes of enhancements, paving the way for more robust and trustworthy AI systems.  more » « less
Award ID(s):
2243822
PAR ID:
10646735
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
IAAI
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
39
Issue:
28
ISSN:
2159-5399
Page Range / eLocation ID:
28931 to 28937
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The need for interpretable and accountable intelligent systems grows along with the prevalence of artificial intelligence ( AI ) applications used in everyday life. Explainable AI ( XAI ) systems are intended to self-explain the reasoning behind system decisions and predictions. Researchers from different disciplines work together to define, design, and evaluate explainable systems. However, scholars from different disciplines focus on different objectives and fairly independent topics of XAI research, which poses challenges for identifying appropriate design and evaluation methodology and consolidating knowledge across efforts. To this end, this article presents a survey and framework intended to share knowledge and experiences of XAI design and evaluation methods across multiple disciplines. Aiming to support diverse design goals and evaluation methods in XAI research, after a thorough review of XAI related papers in the fields of machine learning, visualization, and human-computer interaction, we present a categorization of XAI design goals and evaluation methods. Our categorization presents the mapping between design goals for different XAI user groups and their evaluation methods. From our findings, we develop a framework with step-by-step design guidelines paired with evaluation methods to close the iterative design and evaluation cycles in multidisciplinary XAI teams. Further, we provide summarized ready-to-use tables of evaluation methods and recommendations for different goals in XAI research. 
    more » « less
  2. The rapid adoption of generative AI in software development has impacted the industry, yet its efects on developers with visual impairments remain largely unexplored. To address this gap, we used an Activity Theory framework to examine how developers with visual impairments interact with AI coding assistants. For this purpose, we conducted a study where developers who are visually impaired completed a series of programming tasks using a generative AI coding assistant. We uncovered that, while participants found the AI assistant benefcial and reported signifcant advantages, they also highlighted accessibility challenges. Specifcally, the AI coding assistant often exacerbated existing accessibility barriers and introduced new challenges. For example, it overwhelmed users with an excessive number of suggestions, leading developers who are visually impaired to express a desire for “AI timeouts.” Additionally, the generative AI coding assistant made it more difcult for developers to switch contexts between the AI-generated content and their own code. Despite these challenges, participants were optimistic about the potential of AI coding assistants to transform the coding experience for developers with visual impairments. Our fndings emphasize the need to apply activity-centered design principles to generative AI assistants, ensuring they better align with user behaviors and address specifc accessibility needs. This approach can enable the assistants to provide more intuitive, inclusive, and efective experiences, while also contributing to the broader goal of enhancing accessibility in software development 
    more » « less
  3. Research in Responsible AI has developed a range of principles and practices to ensure that machine learning systems are used in a manner that is ethical and aligned with human values. However, a critical yet often neglected aspect of ethical ML is the ethical implications that appear when designing evaluations of ML systems. For instance, teams may have to balance a trade-off between highly informative tests to ensure downstream product safety, with potential fairness harms inherent to the implemented testing procedures. We conceptualize ethics-related concerns in standard ML evaluation techniques. Specifically, we present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. The framework is then a tool for characterizing challenges teams face, and systematically disentangling competing considerations that teams seek to balance. Differentiating between different types of issues encountered in evaluation allows us to highlight best practices from analogous domains, such as clinical trials and automotive crash testing, which navigate these issues in ways that can offer inspiration to improve evaluation processes in ML. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities that arise during the evaluation of ML systems, and for the industry to move towards designing institutional policies to support ethical evaluations. 
    more » « less
  4. Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as "melt"), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results. 
    more » « less
  5. Abstract Miscommunication between instructors and students is a significant obstacle to post-secondary learning. Students may skip office hours due to insecurities or scheduling conflicts, which can lead to missed opportunities for questions. To support self-paced learning and encourage creative thinking skills, academic institutions must redefine their approach to education by offering flexible educational pathways that recognize continuous learning. To this end, we developed an AI-augmented intelligent educational assistance framework based on a powerful language model (i.e., GPT-3) that automatically generates course-specific intelligent assistants regardless of discipline or academic level. The virtual intelligent teaching assistant (TA) system, which is at the core of our framework, serves as a voice-enabled helper capable of answering a wide range of course-specific questions, from curriculum to logistics and course policies. By providing students with easy access to this information, the virtual TA can help to improve engagement and reduce barriers to learning. At the same time, it can also help to reduce the logistical workload for instructors and TAs, freeing up their time to focus on other aspects of teaching and supporting students. Its GPT-3-based knowledge discovery component and the generalized system architecture are presented accompanied by a methodical evaluation of the system’s accuracy and performance. 
    more » « less