 NSFPAR ID:
 10387293
 Editor(s):
 Pennock, David M.; Segal, Ilya; Seuken, Sven
 Publisher / Repository:
 ACM
 Date Published:
 Journal Name:
 Proceedings of the 23rd ACM Conference on Economics and Computation
 Page Range / eLocation ID:
 1121 to 1140
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

We pose and study the problem of planning in Markov decision processes (MDPs), subject to participation constraints as studied in mechanism design. In this problem, a planner must work with a selfinterested agent on a given MDP. Each action in the MDP provides an immediate reward to the planner and a (possibly different) reward to the agent. The agent has no control in choosing the actions, but has the option to end the entire process at any time. The goal of the planner is to find a policy that maximizes her cumulative reward, taking into consideration the agent's ability to terminate. We give a fully polynomialtime approximation scheme for this problem. En route, we present polynomialtime algorithms for computing (exact) optimal policies for important special cases of this problem, including when the time horizon is constant, or when the MDP exhibits a "definitive decisions" property. We illustrate our algorithms with two different gametheoretic applications: the problem of assigning rides in ridesharing and the problem of designing screening policies. Our results imply efficient algorithms for computing (approximately) optimal policies in both applications.more » « less

Guruswami, Venkatesan (Ed.)We study two recent combinatorial contract design models, which highlight different sources of complexity that may arise in contract design, where a principal delegates the execution of a costly project to others. In both settings, the principal cannot observe the choices of the agent(s), only the project’s outcome (success or failure), and incentivizes the agent(s) using a contract, a payment scheme that specifies the payment to the agent(s) upon a project’s success. We present results that resolve open problems and advance our understanding of the computational complexity of both settings. In the multiagent setting, the project is delegated to a team of agents, where each agent chooses whether or not to exert effort. A success probability function maps any subset of agents who exert effort to a probability of the project’s success. For the family of submodular success probability functions, Dütting et al. [2023] established a polytime constant factor approximation to the optimal contract, and left open whether this problem admits a PTAS. We answer this question on the negative, by showing that no polytime algorithm guarantees a better than 0.7approximation to the optimal contract. For XOS functions, they give a polytime constant approximation with value and demand queries. We show that with value queries only, one cannot get any constant approximation. In the multiaction setting, the project is delegated to a single agent, who can take any subset of a given set of actions. Here, a success probability function maps any subset of actions to a probability of the project’s success. Dütting et al. [2021a] showed a polytime algorithm for computing an optimal contract for gross substitutes success probability functions, and showed that the problem is NPhard for submodular functions. We further strengthen this hardness result by showing that this problem does not admit any constant factor approximation. Furthermore, for the broader class of XOS functions, we establish the hardness of obtaining a n^{1/2+ε}approximation for any ε > 0.more » « less

In many realworld reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinitehorizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two modelbased constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GMCRL algorithm, where the algorithm has access to a generative model, and (ii) UCCRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.more » « less

null (Ed.)In decisionmaking problems, the actions of an agent may reveal sensitive information that drives its decisions. For instance, a corporation’s investment decisions may reveal its sensitive knowledge about market dynamics. To prevent this type of information leakage, we introduce a policy synthesis algorithm that protects the privacy of the transition probabilities in a Markov decision process. We use differential privacy as the mathematical definition of privacy. The algorithm first perturbs the transition probabilities using a mechanism that provides differential privacy. Then, based on the privatized transition probabilities, we synthesize a policy using dynamic programming. Our main contribution is to bound the "cost of privacy," i.e., the difference between the expected total rewards with privacy and the expected total rewards without privacy. We also show that computing the cost of privacy has time complexity that is polynomial in the parameters of the problem. Moreover, we establish that the cost of privacy increases with the strength of differential privacy protections, and we quantify this increase. Finally, numerical experiments on two example environments validate the established relationship between the cost of privacy and the strength of data privacy protections.more » « less

In many realworld reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinitehorizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two modelbased constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GMCRL algorithm, where the algorithm has access to a generative model, and (ii) UCCRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.