We study the problem of learning personalized decision policies from observational data while accounting for possible unobserved confounding in the datagenerating process. Unlike previous approaches that assume unconfoundedness, i.e., no unobserved confounders affected both treatment assignment and outcomes, we calibrate policy learning for realistic violations of this unverifiable assumption with uncertainty sets motivated by sensitivity analysis in causal inference. Our framework for confoundingrobust policy improvement optimizes the minimax regret of a candidate policy against a baseline or reference "status quo" policy, over an uncertainty set around nominal propensity weights. We prove that if the uncertainty set is wellspecified, robust policy learning can do no worse than the baseline, and only improve if the data supports it. We characterize the adversarial subproblem and use efficient algorithmic solutions to optimize over parametrized spaces of decision policies such as logistic treatment assignment. We assess our methods on synthetic data and a large clinical trial, demonstrating that confounded selection can hinder policy learning and lead to unwarranted harm, while our robust approach guarantees safety and focuses on wellevidenced improvement.
Sample Complexity of Robust Reinforcement Learning with a Generative Model
The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mis matches between the simulator model and realworld settings. An RMDP problem is typically formulated as a maxmin problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a modelbased reinforcement learning (RL) algorithm for learning an εoptimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chisquare divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.
 Publication Date:
 NSFPAR ID:
 10327542
 Journal Name:
 International Conference on Artificial Intelligence and Statistics (AISTATS)
 Page Range or eLocationID:
 95829602
 Sponsoring Org:
 National Science Foundation
More Like this


Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the \emph{number of episodes} it takes to provably discover a policy whose value is eps near to that of the optimal value, where the value is measured by the \emph{normalized} cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon  a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only \emph{logarithmically} with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an epsnet formore »

Detection of malicious behavior is a fundamental problem in security. One of the major challenges in using detection systems in practice is in dealing with an overwhelming number of alerts that are triggered by normal behavior (the socalled false positives), obscuring alerts resulting from actual malicious activity. While numerous methods for reducing the scope of this issue have been proposed, ultimately one must still decide how to prioritize which alerts to investigate, and most existing prioritization methods are heuristic, for example, based on suspiciousness or priority scores. We introduce a novel approach for computing a policy for prioritizing alerts using adversarial reinforcement learning. Our approach assumes that the attackers know the full state of the detection system and dynamically choose an optimal attack as a function of this state, as well as of the alert prioritization policy. The first step of our approach is to capture the interaction between the defender and attacker in a game theoretic model. To tackle the computational complexity of solving this game to obtain a dynamic stochastic alert prioritization policy, we propose an adversarial reinforcement learning framework. In this framework, we use neural reinforcement learning to compute best response policies for both the defender andmore »

Modelbased reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multiagent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the nonstationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widelyused, the sample complexity of modelbased MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. We study arguably the most basic MARL setting: twoplayer discounted zerosum Markov games, given only access to a generative model of state transition. We show that modelbased MARL achieves a near optimal sample complexity for finding the Nash equilibrium (NE) \emph{value} up to some additive error. We also show that this method is nearminimax optimal with a tight dependence on the horizon and the number of states. Our results justify the efficiency of this simple modelbased approach in the multiagent RL setting.

Cuttingplane methods have enabled remarkable successes in integer programming over the last few decades. Stateoftheart solvers integrate a myriad of cuttingplane techniques to speed up the underlying treesearch algorithm used to find optimal solutions. In this paper we prove the first guarantees for learning highperforming cutselection policies tailored to the instance distribution at hand using samples. We first bound the sample complexity of learning cutting planes from the canonical family of ChvátalGomory cuts. Our bounds handle any number of waves of any number of cuts and are fine tuned to the magnitudes of the constraint coefficients. Next, we prove sample complexity bounds for more sophisticated cut selection policies that use a combination of scoring rules to choose from a family of cuts. Finally, beyond the realm of cutting planes for integer programming, we develop a general abstraction of tree search that captures key components such as node selection and variable selection. For this abstraction, we bound the sample complexity of learning a good policy for building the search tree.