skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Off-Switch Game
It is clear that one of the primary tools we can use to mitigate the potential risk from a misbehaving AI system is the ability to turn the system off. As the capabilities of AI systems improve, it is important to ensure that such systems do not adopt subgoals that prevent a human from switching the system off. This is a challenge because many formulations of rational agents create strong incentives for self-preservation. This is not caused by a built-in instinct, but because a rational agent will maximize expected utility and cannot achieve whatever objective it has been given if it is dead. Our goal is to study the incentives an agent has to allow itself to be switched off. We analyze a simple game between a human H and a robot R, where H can press R’s off switch but R can disable the off switch. A traditional agent takes its reward function for granted: we show that such agents have an incentive to disable the off switch, except in the special case where H is perfectly rational. Our key insight is that for R to want to preserve its off switch, it needs to be uncertain about the utility associated with the outcome, and to treat H’s actions as important observations about that utility. (R also has no incentive to switch itself off in this setting.) We conclude that giving machines an appropriate level of uncertainty about their objectives leads to safer designs, and we argue that this setting is a useful generalization of the classical AI paradigm of rational agents.  more » « less
Award ID(s):
1734633
PAR ID:
10063830
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Joint Conferences on Artificial Intelligence Organization
Page Range / eLocation ID:
220 to 227
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In recent years, federated learning has been embraced as an approach for bringing about collaboration across large populations of learning agents. However, little is known about how collaboration protocols should take agents’ incentives into account when allocating individual resources for communal learning in order to maintain such collaborations. Inspired by game theoretic notions, this paper introduces a framework for incentive-aware learning and data sharing in federated learning. Our stable and envy-free equilibria capture notions of collaboration in the presence of agents interested in meeting their learning objectives while keeping their own sample collection burden low. For example, in an envy-free equilibrium, no agent would wish to swap their sampling burden with any other agent and in a stable equilibrium, no agent would wish to unilaterally reduce their sampling burden. In addition to formalizing this framework, our contributions include characterizing the structural properties of such equilibria, proving when they exist, and showing how they can be computed. Furthermore, we compare the sample complexity of incentive-aware collaboration with that of optimal collaboration when one ignores agents’ incentives. 
    more » « less
  2. Speakers build rapport in the process of aligning conversational behaviors with each other. Rapport engendered with a teachable agent while instructing domain material has been shown to promote learning. Past work on lexical alignment in the field of education suffers from limitations in both the measures used to quantify alignment and the types of interactions in which alignment with agents has been studied. In this paper, we apply alignment measures based on a data-driven notion of shared expressions (possibly composed of multiple words) and compare alignment in one-on-one human-robot (H-R) interactions with the H-R portions of collaborative human-human-robot (H-H-R) interactions. We find that students in the H-R setting align with a teachable robot more than in the H-H-R setting and that the relationship between lexical alignment and rapport is more complex than what is predicted by previous theoretical and empirical work. 
    more » « less
  3. We consider an InterDependent Security (IDS) game with networked agents and positive externality where each agent chooses an effort/investment level for securing itself. The agents are interdependent in that the state of security of one agent depends not only on its own investment but also on the other agents' effort/investment. Due to the positive externality, the agents under-invest in security which leads to an inefficient Nash equilibrium (NE). While much has been analyzed in the literature on the under-investment issue, in this study we take a different angle. Specifically, we consider the possibility of allowing agents to pool their resources, i.e., allowing agents to have the ability to both invest in themselves as well as in other agents. We show that the interaction of strategic and selfish agents under resource pooling (RP) improves the agents' effort/investment level as well as their utility as compared to a scenario without resource pooling. We show that the social welfare (total utility) at the NE of the game with resource pooling is higher than the maximum social welfare attainable in a game without resource pooling but by using an optimal incentive mechanism. Furthermore, we show that while voluntary participation in this latter scenario is not generally true, it is guaranteed under resource pooling. 
    more » « less
  4. Chenyang Lu (Ed.)
    The design and analysis of multi-agent human cyber-physical systems in safety-critical or industry-critical domains calls for an adequate semantic foundation capable of exhaustively and rigorously describing all emergent effects in the joint dynamic behavior of the agents that are relevant to their safety and well-behavior. We present such a semantic foundation. This framework extends beyond previous approaches by extending the agent-local dynamic state beyond state components under direct control of the agent and belief about other agents (as previously suggested for understanding cooperative as well as rational behavior) to agent-local evidence and belief about the overall cooperative, competitive, or coopetitive game structure. We argue that this extension is necessary for rigorously analyzing systems of human cyber-physical systems because humans are known to employ cognitive replacement models of system dynamics that are both non-stationary and potentially incongruent. These replacement models induce visible and potentially harmful effects on their joint emergent behavior and the interaction with cyber-physical system components. 
    more » « less
  5. null (Ed.)
    We propose a pseudo-market mechanism for no-monetary-transfer allocation of indivisible objects based on priorities such as those in school choice. Agents are given token money, face priority-specific prices, and buy utility-maximizing random assignments. The mechanism is asymptotically incentive compatible, and the resulting assignments are fair and constrained Pareto efficient. Hylland and Zeckhauser’s (1979) position-allocation problem is a special case of our framework, and our results on incentives and fairness are also new in their classical setting. (JEL D63, D82, H75, I21, I28) 
    more » « less