skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 11 until 2:00 AM ET on Saturday, July 12 due to maintenance. We apologize for the inconvenience.


Title: Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach
This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain theground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could helpconsistently reduce the asymptotic regret.  more » « less
Award ID(s):
1910284 1940093 2137335 2147375
PAR ID:
10527564
Author(s) / Creator(s):
;
Publisher / Repository:
AAAI
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
38
Issue:
18
ISSN:
2159-5399
Page Range / eLocation ID:
20438 to 20446
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data that is gathered adaptively --- via bandit algorithms, for example --- exhibits bias. This is true both when gathering simple numeric valued data --- the empirical means kept track of by stochastic bandit algorithms are biased downwards --- and when gathering more complicated data --- running hypothesis tests on complex data gathered via contextual bandit algorithms leads to false discovery. In this paper, we show that this problem is mitigated if the data collection procedure is differentially private. This lets us both bound the bias of simple numeric valued quantities (like the empirical means of stochastic bandit algorithms), and correct the p-values of hypothesis tests run on the adaptively gathered data. Moreover, there exist differentially private bandit algorithms with near optimal regret bounds: we apply existing theorems in the simple stochastic case, and give a new analysis for linear contextual bandits. We complement our theoretical results with experiments validating our theory. 
    more » « less
  2. We propose a novel formulation of group fairness with biased feedback in the contextual multi-armed bandit (CMAB) setting. In the CMAB setting, a sequential decision maker must, at each time step, choose an arm to pull from a finite set of arms after observing some context for each of the potential arm pulls. In our model, arms are partitioned into two or more sensitive groups based on some protected feature(s) (e.g., age, race, or socio-economic status). Initial rewards received from pulling an arm may be distorted due to some unknown societal or measurement bias. We assume that in reality these groups are equal despite the biased feedback received by the agent. To alleviate this, we learn a societal bias term which can be used to both find the source of bias and to potentially fix the problem outside of the algorithm. We provide a novel algorithm that can accommodate this notion of fairness for an arbitrary number of groups, and provide a theoretical bound on the regret for our algorithm. We validate our algorithm using synthetic data and two real-world datasets for intervention settings wherein we want to allocate resources fairly across groups. 
    more » « less
  3. We study a nonparametric contextual bandit problem in which the expected reward functions belong to a Hölder class with smoothness parameter β. We show how this interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (β at most 1), with which rate-optimal regret is achieved by running separate noncontextual bandits in different context regions, and parametric-response bandits (infinite [Formula: see text]), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings, and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and nondifferentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits. 
    more » « less
  4. We study a variant of the contextual bandit problem where an agent can intervene through a set of stochastic expert policies. Given a fixed context, each expert samples actions from a fixed conditional distribution. The agent seeks to remain competitive with the “best” among the given set of experts. We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts and provide horizon-independent constant regret bounds that only scale linearly in the number of experts. We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions. Further, we investigate the episodic setting where the agent interacts with an environment that changes over episodes. Each episode can have different context and reward distributions resulting in the best expert changing across episodes. We show that by bootstrapping from\(\mathcal {O}(N\log (NT^2\sqrt {E}))\)samples, ED-UCB guarantees a regret that scales as\(\mathcal {O}(E(N+1) + \frac{N\sqrt {E}}{T^2})\)forNexperts overEepisodes, each of lengthT. We finally empirically validate our findings through simulations. 
    more » « less
  5. In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret. 
    more » « less