skip to main content


Title: The Off-Switch Game

It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching the system off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.

 
more » « less
Award ID(s):
1734633
NSF-PAR ID:
10063830
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Joint Conferences on Artificial Intelligence Organization
Page Range / eLocation ID:
220 to 227
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract  
    more » « less
  2. Systems engineering processes (SEPs) coordinate the effort of different individuals to generate a product satisfying certain requirements. As the involved engineers are self-interested agents, the goals at different levels of the systems engineering hierarchy may deviate from the system-level goals, which may cause budget and schedule overruns. Therefore, there is a need of a systems engineering theory that accounts for the human behavior in systems design. As experience in the physical sciences shows, a lot of knowledge can be generated by studying simple hypothetical scenarios, which nevertheless retain some aspects of the original problem. To this end, the objective of this article is to study the simplest conceivable SEP, a principalagent model of a one-shot, shallow SEP. We assume that the systems engineer (SE) maximizes the expected utility of the system, while the subsystem engineers (sSE) seek to maximize their expected utilities. Furthermore, the SE is unable to monitor the effort of the sSE and may not have complete information about their types. However, the SE can incentivize the sSE by proposing specific contracts. To obtain an optimal incentive, we pose and solve numerically a bilevel optimization problem. Through extensive simulations, we study the optimal incentives arising from different system-level value functions under various combinations of effort costs, problem-solving skills, and task complexities. Our numerical examples show that, the passed-down requirements to the agents increase as the task complexity and uncertainty grow and they decrease with increasing the agents' costs. 
    more » « less
  3. Abstract Engineering design involves information acquisition decisions such as selecting designs in the design space for testing, selecting information sources, and deciding when to stop design exploration. Existing literature has established normative models for these decisions, but there is lack of knowledge about how human designers make these decisions and which strategies they use. This knowledge is important for accurately modeling design decisions, identifying sources of inefficiencies, and improving the design process. Therefore, the primary objective in this study is to identify models that provide the best description of a designer’s information acquisition decisions when multiple information sources are present and the total budget is limited. We conduct a controlled human subject experiment with two independent variables: the amount of fixed budget and the monetary incentive proportional to the saved budget. By using the experimental observations, we perform Bayesian model comparison on various simple heuristic models and expected utility (EU)-based models. As expected, the subjects’ decisions are better represented by the heuristic models than the EU-based models. While the EU-based models result in better net payoff, the heuristic models used by the subjects generate better design performance. The net payoff using heuristic models is closer to the EU-based models in experimental treatments where the budget is low and there is incentive for saving the budget. This indicates the potential for nudging designers’ decisions toward maximizing the net payoff by setting the fixed budget at low values and providing monetary incentives proportional to saved budget. 
    more » « less
  4. We investigate how sequential decision making analysis can be used for modeling system resilience. In the aftermath of an extreme event, agents involved in the emergency management aim at an optimal recovery process, trading off the loss due to lack of system functionality with the investment needed for a fast recovery. This process can be formulated as a sequential decision-making optimization problem, where the overall loss has to be minimized by adopting an appropriate policy, and dynamic programming applied to Markov Decision Processes (MDPs) provides a rational and computationally feasible framework for a quantitative analysis. The paper investigates how trends of post-event loss and recovery can be understood in light of the sequential decision making framework. Specifically, it is well known that system’s functionality is often taken to a level different from that before the event: this can be the result of budget constraints and/or economic opportunity, and the framework has the potential of integrating these considerations. But we focus on the specific case of an agent learning something new about the process, and reacting by updating the target functionality level of the system. We illustrate how this can happen in a simplified setting, by using Hidden-Model MPDs (HM-MDPs) for modelling the management of a set of components under model uncertainty. When an extreme event occurs, the agent updates the hazard model and, consequently, her response and long-term planning. 
    more » « less
  5. Explanations of AI Agents' actions are considered to be an important factor in improving users' trust in the decisions made by autonomous AI systems. However, as these autonomous systems evolve from reactive, i.e., acting on user input, to proactive, i.e., acting without requiring user intervention, there is a need to explore how the explanation for the actions of these agents should evolve. In this work, we explore the design of explanations through participatory design methods for a proactive auto-response messaging agent that can reduce perceived obligations and social pressure to respond quickly to incoming messages by providing unavailability-related context. We recruited 14 participants who worked in pairs during collaborative design sessions where they reasoned about the agent's design and actions. We qualitatively analyzed the data collected through these sessions and found that participants' reasoning about agent actions led them to speculate heavily on its design. These speculations significantly influenced participants' desire for explanations and the controls they sought to inform the agents' behavior. Our findings indicate a need to transform users' speculations into accurate mental models of agent design. Further, since the agent acts as a mediator in human-human communication, it is also necessary to account for social norms in its explanation design. Finally, user expertise in understanding their habits and behaviors allows the agent to learn from the user their preferences when justifying its actions.

     
    more » « less