- Award ID(s):
- 1704908
- PAR ID:
- 10111018
- Date Published:
- Journal Name:
- Advances in Neural Information Processing Systems 31
- Volume:
- 31
- Page Range / eLocation ID:
- 2568--2578
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Causal knowledge is sought after throughout data-driven fields due to its explanatory power and potential value to inform decision-making. If the targeted system is well-understood in terms of its causal components, one is able to design more precise and surgical interventions so as to bring certain desired outcomes about. The idea of leveraging the causal understand- ing of a system to improve decision-making has been studied in the literature under the rubric of structural causal bandits (Lee and Bareinboim, 2018). In this setting, (1) pulling an arm corresponds to performing a causal intervention on a set of variables, while (2) the associated rewards are governed by the underlying causal mechanisms. One key assumption of this work is that any observed variable (X) in the system is manipulable, which means that intervening and making do(X = x) is always realizable. In many real-world scenarios, however, this is a too stringent requirement. For instance, while scientific evidence may support that obesity shortens life, it’s not feasible to manipulate obesity directly, but, for example, by decreasing the amount of soda consumption (Pearl, 2018). In this paper, we study a relaxed version of the structural causal bandit problem when not all variables are manipulable. Specifically, we develop a procedure that takes as argument partially specified causal knowledge and identifies the possibly-optimal arms in structural bandits with non-manipulable variables. We further introduce an algorithm that uncovers non-trivial dependence structure among the possibly-optimal arms. Finally, we corroborate our findings with simulations, which shows that MAB solvers enhanced with causal knowledge and leveraging the newly discovered dependence structure among arms consistently outperform their causal-insensitive counterparts.more » « less
-
Restless multi-armed bandits (RMAB) have been widely used to model sequential decision making problems with constraints. The decision maker (DM) aims to maximize the expected total reward over an infinite horizon under an “instantaneous activation constraint” that at most B arms can be activated at any decision epoch, where the state of each arm evolves stochastically according to a Markov decision process (MDP). However, this basic model fails to provide any fairness guarantee among arms. In this paper, we introduce RMAB-F, a new RMAB model with “long-term fairness constraints”, where the objective now is to maximize the longterm reward while a minimum long-term activation fraction for each arm must be satisfied. For the online RMAB-F setting (i.e., the underlying MDPs associated with each arm are unknown to the DM), we develop a novel reinforcement learning (RL) algorithm named Fair-UCRL. We prove that Fair-UCRL ensures probabilistic sublinear bounds on both the reward regret and the fairness violation regret. Compared with off-the-shelf RL methods, our Fair-UCRL is much more computationally efficient since it contains a novel exploitation that leverages a low-complexity index policy for making decisions. Experimental results further demonstrate the effectiveness of our Fair-UCRL.
-
This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain theground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could helpconsistently reduce the asymptotic regret.
-
Abstract Many real-world decision-making tasks require learning causal relationships between a set of variables. Traditional causal discovery methods, however, require that all variables are observed, which is often not feasible in practical scenarios. Without additional assumptions about the unobserved variables, it is not possible to recover any causal relationships from observational data. Fortunately, in many applied settings, additional structure among the confounders can be expected. In particular, pervasive confounding is commonly encountered and has been utilised for consistent causal estimation in linear causal models. In this article, we present a provably consistent method to estimate causal relationships in the nonlinear, pervasive confounding setting. The core of our procedure relies on the ability to estimate the confounding variation through a simple spectral decomposition of the observed data matrix. We derive a DAG score function based on this insight, prove its consistency in recovering a correct ordering of the DAG, and empirically compare it to previous approaches. We demonstrate improved performance on both simulated and real datasets by explicitly accounting for both confounders and nonlinear effects.
-
In many real-world settings, a decision-maker must combine information provided by different experts in order to decide on an effective policy. Alrajeh, Chockler, and Halpern showed how to combine causal models that are compatible in the sense that, for variables that appear in both models, the experts agree on the causal structure. In this work we show how causal models can be combined in cases where the experts might disagree on the causal structure for variables that appear in both models due to having different focus areas. We provide a new formal definition of compatibility of models in this setting and show how compatible models can be combined. We also consider the complexity of determining whether models are compatible. We believe that the notions defined in this work are of direct relevance to many practical decision making scenarios that come up in natural, social, and medical science settings.more » « less