<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Causal Logistic Bandits with Counterfactual Fairness Constraints</title></titleStmt>
			<publicationStmt>
				<publisher>The 42nd International Conference on Machine Learning</publisher>
				<date>07/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10614745</idno>
					<idno type="doi"></idno>
					
					<author>Jiajun Chen</author><author>Jin Tian</author><author>Christopher J Quinn</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Artificial intelligence will play a significant role in decision making in numerous aspects of society. Numerous fairness criteria have been proposed in the machine learning community, but there remains limited investigation into fairness as defined through specified attributes in a sequential decision-making framework. In this paper, we focus on causal logistic bandit problems where the learner seeks to make fair decisions, under a notion of fairness that accounts for counterfactual reasoning. We propose and analyze an algorithm by leveraging primal-dual optimization for constrained causal logistic bandits where the non-linear constraints are a priori unknown and must be learned in time. We obtain sub-linear regret guarantees with leading term similar to that for unconstrained logistic bandits (Lee et al., 2024) while guaranteeing sub-linear constraint violations. We show how to achieve zero cumulative constraint violations with a small increase in the regret bound.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Artificial intelligence (AI) models, using techniques from statistics and machine learning, are increasingly being used to make affect people's lives. In light of this, a plethora of formal fairness criteria have been proposed <ref type="bibr">(Darlington, 1971;</ref><ref type="bibr">Dwork et al., 2012;</ref><ref type="bibr">Hardt et al., 2016;</ref><ref type="bibr">Zhang et al., 2016;</ref><ref type="bibr">Kusner et al., 2017;</ref><ref type="bibr">Zafar et al., 2017;</ref><ref type="bibr">Nabi &amp; Shpitser, 2018;</ref><ref type="bibr">Chiappa, 2019;</ref><ref type="bibr">Chouldechova &amp; Roth, 2020;</ref><ref type="bibr">Imai &amp; Jiang, 2023;</ref><ref type="bibr">Plecko &amp; Bareinboim, 2024)</ref>. There has been growing interest in the sequential decision-making community for accounting for fairness, including in settings such as classic and contextual bandits <ref type="bibr">(Joseph et al., 2018)</ref>, combinatorial bandits <ref type="bibr">(Xu et al., 2020)</ref>, bandits with long-term constraints <ref type="bibr">(Liu et al., 2022)</ref>, and reinforcement learning <ref type="bibr">(Jabbari et al., 2017)</ref>, just to name a few. Notably, rather than addressing fairness through the lens of specified attributes, these studies typically operationalize fairness in a different manner by defining it with respect to one-step rewards and introducing a notion of meritocratic fairness <ref type="bibr">(Joseph et al., 2018)</ref>. An algorithm should never assign a higher selection probability to a less qualified decision than to a more qualified one, i.e., arms with higher empirical rewards should be picked more frequently than those with lower empirical rewards, which is distinguishable from the fairness criteria based on specified attribute.</p><p>In this paper, we focus on a problem structure wherein arms arrive in a sequential and stochastic manner from an underlying fixed distribution and decisions are made in an online fashion by the agent. The objective of the agent is to optimize cumulative rewards while achieving fairness counterfactually with respect to specified attributes, i.e. the outcome would not have been substantially different if the specified attributes had different values. In general, this type of task belongs to the setting of dynamic treatment regimes <ref type="bibr">(Murphy, 2003;</ref><ref type="bibr">Lavori &amp; Dawson, 2008;</ref><ref type="bibr">Zhang, 2020)</ref> for finding a sequence of decisions over a finite set of treatments which appears across a broad range of applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Our Contributions</head><p>In light of the above, the goal of this paper is to analyze the foundations of online causal fair decision-making. More specially, our contributions are as follows:</p><p>&#8226; We formulate a constrained causal logistic bandits problem where the online decision-making processes are characterized within a causal structure. We formalize a (non-linear) fairness constraint based on the counterfactual outcome effect that is a priori unknown and must be learned in time. To the best of our knowledge, this is the first work to study constrained logistic bandits without a known safe decision subset (see Footnote 1).</p><p>&#8226; We provide an unified analysis for the confidence set construction, algorithm design, and performance guarantee, i.e., sublinear reward regret and sublinear cumulative constraint violations by leveraging the regret-toconfidence-set conversion and the primal-dual online Table <ref type="table">1</ref>: Comparison of reward model, constraint types, frequentist regret guarantees, and cumulative violations upper bounds for select related works. Notation: horizon T , the dimension for arm feature vector n, bounded bandit parameter S, truncated parameter &#961; (see Section 3.2), Slater's constant &#948; (see Assumption 3), decision-set-dependent term R Z (T ), generalized linear model (GLM). Problem dependent constant &#954; * , &#954; Z , &#954;, where 1/&#954; * &lt; 1 &#8810; &#954; Z &#8810; &#954; when compared within the same decision and parameter spaces. &#8224; Requires prior knowledge (see Footnote 1) of a safe action/policy (i.e. satisfies the constraint &#8710; &#960;0 &#8804; &#964; ) (per-round zero constraint violations with high probability).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm</head><p>Model Constraint Regret Violation Safe UCB-GLM <ref type="bibr">(Amani et al., 2020)</ref> GLM GLM O &#954;n &#8730; T 0 &#8224; OFULog-r <ref type="bibr">(Abeille et al., 2021)</ref> Logistic No</p><p>O nS 5 2 T &#954; * (T ) + min{n 2 S 4 &#954;Z , RZ (T )} -F-UCB (Huang et al., 2022b) Causal MAB Linear O 1 &#964; -&#8710;&#960; 0 |W |T 0 &#8224; OFULog+ (Lee et al., 2024) Logistic No O nS T &#954; * (T ) + min{n 2 S 2 &#954;Z , RZ (T )} -CCLB Theorem 1 Causal logistic Mixture logistics</p><p>Causal logistic</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mixture logistics</head><p>O &#961;nS &#8730; T + &#961;n 2 S 2 &#954;Z + &#961; &#8730; T (1 + &#948;) 2 0 optimization. We show that the leading term of our regret, &#213;(&#961;nS &#8730; T ), is significantly better than regret bounds for related works handling constraints and is similar to the bound for the unconstrained problem <ref type="bibr">(Lee et al., 2024)</ref> (see Remark 2). Furthermore, by introducing an user-chosen parameter, one can trade off the regret slightly to achieve zero cumulative constraint violations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Related Work</head><p>We next briefly discuss two lines of literature closely related to our work. See Appendix B for more discussion on additional works.</p><p>Firstly, in terms of formulating causal fairness within a multi-armed bandit setting, the closest related work is <ref type="bibr">(Huang et al., 2022b)</ref>. Like us, they considered a stochastic, contextual MAB problem with a (known) causal graph governing relationships between the stochastic contexts (seen by the learner before making decisions) and the rewards. They assume all variables are discrete. Like us, they proposed characterizing fairness with counterfactual fairness <ref type="bibr">(Kusner et al., 2017;</ref><ref type="bibr">Wu et al., 2019;</ref><ref type="bibr">Chiappa, 2019)</ref> w.r.t. specified attributes in the context (e.g. specified user features in an online recommendation system). They make an assumption 1 about a fair policy; they provide high probabil-1 <ref type="bibr">Pacchiano et al. (2021)</ref>  <ref type="bibr">(which Huang et al. (2022b)</ref>'s analysis is based on) requires explicit a priori knowledge of a feasible action/policy (Assumption 5) and states that it is "absolutely necessary" to do so for the problem they study (Remark 1). <ref type="bibr">Huang et al. (2022b)</ref>'s Assumption 3 only requires the existence of a safe policy &#960;0; &#960;0 is not explicitly used in estimating rewards or estimating a ity guarantees that all actions are fair. We model fairness as a long-term constraint, for which we seek to bound cumulative violations, as it is unclear whether it is possible to certify policies as fair (feasible) before collecting data to estimate the reward parameter upon which the (non-linear) constraint depends. Unlike our work, they considered that all variables except the reward are discrete-valued with non-parametric (thus more flexible) distributions. They proposed simpler empirical estimation methods for rewards, for which the counterfactual constraints became linear. While the structural causal model was discrete-valued but non-parametric, their regret bound in turn depended on |W |, the number of realizations of the set of parent variables of the reward, which is exponential in the size of the parent set (see Table <ref type="table">1</ref>). In contrast, we model rewards parametrically (using a logistic model), depending on feature maps of the context and decision variables. This dramatically improves the dimensional dependence, though the fairness constraint becomes a mixture of logistic functions for which estimating confidence bounds (to estimate region of fair actions) is more challenging.</p><p>Among works on MAB with parametric rewards and unknown (stochastic) constraints, there are numerous works on logistic rewards without constraints and linear rewards with linear constraints (see Appendix B for discussion on those works). The only prior work that like us considered a non-linear (parametric) reward model with non-linear unset of feasible policies in the main paper. However, to the best of our knowledge it is unclear how the conservatively estimated sets of policies &#934;t (shown w.h.p. to be feasible for all rounds) that are used to select actions could be guaranteed to be non-empty in early rounds without a known safe policy &#960;0 or additional assumptions. known (stochastic) constraints is <ref type="bibr">(Amani et al., 2020)</ref>. They considered generalized linear rewards where the generalized function is assumed to be twice-differentiable and Lipschitz constant of which logistic rewards is a sub-class. We note that in terms of regret bound alone, their bound specialized to logistic rewards is linearly dependent on the worst case parameter &#954; (see Table <ref type="table">1</ref>), which can be arbitrarily large. They considered generalized linear constraints; like in our work, the constraints depend on the unknown parameter vector &#952; * in the reward function. However, they consider a priori knowledge of some feasible actions. At a high level, they explore the environment (improving their estimate of &#952; * ) using those feasible actions and are able to get high probability guarantees of per-round feasibility. We do not assume such prior knowledge. We instead bound long-term constraint violations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Preliminaries</head><p>In this section, we introduce the basic notations and definitions used throughout the paper. We use capital letters to denote variables (e.g., Y ), lowercase letters (e.g., y) represent scalar values, bold lowercase letters (e.g., y) indicate vectors, and bold uppercase letters (e.g., Y) represent matrices. For a twice-differentiable function g, the notation &#289; and g denote the first and second derivative of function g respectively. For a random variable Z, let Z represent the domain of Z and |Z| the latter's dimension. For two realvalued symmetric matrices A and B, the notation A &#10928; B indicates that A -B is positive semi-definite, and when A is positive definite, we denote A-norm for a vector z as &#8741;z&#8741; A = &#8730; z &#8890; Az. Finally, for two univariate real-valued functions f and g, we denote f = O(g) to indicate that g dominates f up to logarithmic factors; and for an event E &#8712; &#8486;, we write 1{E} the indicator function of E.</p><p>We adopt the language of Structural Causal Models (SCMs) <ref type="bibr">(Pearl, 2009, Ch. 7</ref>). An SCM M is a tuple &#10216;U, V, F , P(u)&#10217;, where U is a set of exogenous (unobserved or latent) variables, V is a set of endogenous (observed) variables, F is a set of structural functions, and P(u) is a distribution over the latent variables. For the set of structural functions F , f Vi &#8712; F decides values of an endogenous variable V i &#8712; V taking as argument a combination of other variables. That is,</p><p>, where P a Vi denotes the parent set (explained below) of V i . Realizations of the set of latent variables U &#8764; P(u) induce an observational distribution P(v) over V . An intervention on a variable</p><p>Each SCM is associated with a directed acyclic graph (DAG) G (e.g., see Figure <ref type="figure">1</ref>), called the causal diagram, where nodes correspond to endogenous variables V, solid arrows represent arguments of each function f V . A bi-directed arrow between nodes V i and V j indicates an unobserved confounder affecting both V i and V j , i.e., U Vi &#8745; U Vj &#824; = &#8709;. We will use the graphtheoretic family abbreviations, e.g., P a(V ) G stand for the set of parents of V in G. Two nodes X and Y are said to be d-separated by a third set Z in a DAG G denoted by (X &#8869; &#8869; Y |Z) G if and only if Z blocks all paths from every node in X to every node in Y . The criterion of blockage follows <ref type="bibr">(Pearl, 2009, Def. 1.2.3)</ref>, included in Appendix A with formal definitions for completeness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">A Theoretical Framework for Constrained</head><p>Causal Logistic Bandits</p><p>In this section, we formalize the constrained causal logistic bandits theoretical framework in the semantics of SCMs and MABs. We start by considering a recruitment example (see Appendix D.1 for more motivating examples), where the decision-making process is characterized by the extended Standard Fairness Model (SFM) (Zhang &amp; Bareinboim, 2018; Plecko &amp; Bareinboim, 2024). See Figure 1 for a graphical model of the SFM. Variable A represents the specified attribute, W is a set of confounded features, and M is a set of intermediate features, D and Y represent the decision and outcome reward. The contextual information {w t , m t , a t } is accessible by the learner before making decisions. 2.1. Logistic bandits with structural causal model At every round t, the learner observes the contextual features {w t , m t , a t }, which are drawn from a stochastic distribution and then is presented a set of decisions D t that depend on the candidate's context. The learner chooses a decision d t &#8712; D t and receives an outcome reward y t+1 &#8712; {0, 1}. The learner's decision is based on previous round knowledge F t = (F 0 , {w t , m t , a t , d t , y t+1 } t-1 t=1</p><p>) and causal information, where F 0 represents any prior knowledge. In our problem, we assume that the outcome variable Y has a generalized linear relationship <ref type="bibr">(Filippi et al., 2010;</ref><ref type="bibr">Li et al., 2017)</ref> with the features Z, specifically,</p><p>where the fixed but unknown parameters &#952; * belongs to R n , g(x) = (1 + e -x ) -1 is the standard logistic function, f is the mapping function that is known ahead of time to the learner, and the encoded feature vector f (Z) is in R n . Then the interventional distribution for the expected reward of do(d t ) and do(a t ) given the observed contextual features m t and w t is represented as <ref type="bibr">(Plecko &amp; Bareinboim, 2024)</ref>:</p><p>where Z t is the feature consisting of the decision d t and the contexts {w t , m t , a t }, In this paper, we consider one specified attribute variable A, which in general is a parent of the decision and outcome variable in the causal graph (see Figure <ref type="figure">1</ref>). Note that the specified attribute value at round t is a t and we denote the counterfactual value as a &#8242; t . Both the decision and (hypothetical) counterfactual intervention on the specified attribute value are atomic interventions <ref type="bibr">(Correa &amp; Bareinboim, 2020)</ref>. Thus, the expected reward for the counterfactual feature a &#8242; t for any decision d</p><p>t is the counterfactual feature for the interventions do(a &#8242; t ) and do(d). Notice that for this problem, we consider that the factual feature Z t and counterfactual feature Z a &#8242; t are both in the feature space Z. The derivation of Equation (2) and Equation (3) follows by the do-calculus rule <ref type="bibr">(Pearl, 1995)</ref>; readers can refer Appendix D.2 for a more detailed analysis. Therefore, the counterfactual fairness effect for decision d t is represented as:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Counterfactual fairness modeling via soft constraint</head><p>In this section, we discuss modeling fairness as part of the learner's decision making problem. Consider a stochastic bandit optimization with a soft constraint for our decisionmaking problem. In particular, at every round t &#8712; [T ], the learner selects a decision d t to maximize the expected reward E[Y |do(d t ), do(a t ), w t , m t ] subject to a constraint on (violations to) counterfactual fairness (Equation (4)),</p><p>where &#964; is a predefined fairness threshold. In this work, the counterfactual fairness constraint by Equation (5) requires that the expected reward is similar regardless if the value of the specified attribute had been different. In addition, the learner only receives bandit feedback (the reward). The learner does not not observe feedback on constraint violations. <ref type="bibr">Huang et al. (2022b)</ref> were the first to propose a counterfactual fairness constraint in a bandit framework. We note that their setting confines the learner to decisions from a safe action set (see Footnote 1). To the best of our knowledge, that setting requires strong assumptions on prior knowledge of a subset of safe actions that can be used even before rewards are estimated (the constraint (4) depends on the unknown reward parameter vector &#952; * ). Prior knowledge of safe actions can be mild in some settings, though we argue counterfactual fairness (a convex combination of logistic functions that depend on the unknown reward parameter vector &#952; * ) is more complex, thereby it is less obvious for us to construct a prior safe decision without knowing any information about the reward distribution. Therefore, for the setting we consider, since no safe actions might known a priori, we allow for instantaneous violations but bound the cumulative violations.</p><p>The goal of the learner is to maximize the cumulative expected outcome reward while minimizing the cumulative expected counterfactual fairness constraint violations throughout the learning process. Define the cumulative expected regret and cumulative expected counterfactual fairness constraint violations as</p><p>where</p><p>In this paper, we establish a stronger version of regret <ref type="bibr">(Liu et al., 2021;</ref><ref type="bibr">Zhou &amp; Ji, 2022)</ref>, specifically, let &#960; t be a probability distribution over the set of actions D t at round t, and let</p><p>We compare the received outcome reward with the following baseline optimization problem:</p><p>t is the optimal solution at step t. Thus, the stronger regret is defined as:</p><p>Note that the probability distribution &#960; t could include some decisions that violate the constraint but on average the constraint is satisfied, while for a single action it must be a feasible one, therefore, R + (T ) &#8805; R(T ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Model assumptions and definition</head><p>To study our constrained causal logistic bandits problem, we make the following standard assumptions <ref type="bibr">(Yu et al., 2017;</ref><ref type="bibr">Efroni et al., 2020;</ref><ref type="bibr">Zhou &amp; Ji, 2022)</ref>. Let &#920; denote a compact set in R n . Let Z denote the feature space domain.</p><p>Assumption 1 (Bounded Bandit Parameter). There is a known bound S on the norm of the (unknown) reward parameter vector &#952;, &#8741;&#952;&#8741; 2 &#8804; S, &#8704; &#952; &#8712; &#920;.</p><p>Assumption 2. The feature mapping function f : Z &#8594; R n is in a reproducing kernel Hilbert space (RKHS) with a bounded norm (i.e., a measure of smoothness), such that &#8741;f (Z)&#8741; 2 &#8804; 1, &#8704; Z &#8712; Z.</p><p>Assumption 3 (Slater's Constraint Qualification). There is a constant &#948; &gt; 0 such that there exists a feasible probability distribution &#960; t,0 over decision set D t that satisfies</p><p>. Without loss of generality, we assume &#948; &#8804; 1.</p><p>Notice that this is a mild assumption since it only requires that one could find a stochastic policy &#960; t,0 under which the expected constraint violations will be strictly less than a negative value. Whereas for hard constraints <ref type="bibr">(Amani et al., 2019;</ref><ref type="bibr">Khezeli &amp; Bitar, 2020;</ref><ref type="bibr">Pacchiano et al., 2021)</ref>, they typically assume that the non-empty initial safe decision set which is stronger than the assumption of existence for a Slater's constant &#948; about the learner's knowledge.</p><p>We next define a problem dependent quantity that impacts learnability.</p><p>Definition 1 (Problem Dependent Constant<ref type="foot">foot_1</ref> ).</p><p>We recall the other problem dependent constants discussed in Table <ref type="table">1</ref>:</p><p>Notice that such problem dependent constants are defined through the first order of logistic function, which quantifies the level of non-linearity of plausible expected reward signals with different scales. In particular, &#954; can be significantly large even for reasonable logistic bandits problems. Readers can refer Section 2 of <ref type="bibr">(Faury et al., 2020)</ref> for a detailed discussion on the importance of this quantity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods for Constrained Causal Logistic Bandits</head><p>We next design an online algorithm for the constrained causal logistic bandits problem. We will then develop a unified analysis of regret and constraint violations with rigorous performance guarantees for our decision making strategy. Before proposing the algorithm, we first construct a convex confidence set for the reward parameter &#952; * using a regret-toconfidence set conversion <ref type="bibr">(Lee et al., 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Convex confidence set</head><p>For logistic bandit problems, a natural way to estimate the reward parameter &#952; * given F t is to use maximum-likelihood estimation. We build on the works for the unconstrained problem <ref type="bibr">(Abeille et al., 2021;</ref><ref type="bibr">Lee et al., 2024)</ref>. At every round t, a reward value y t+1 is sampled from a Bernoulli distribution with expected value (or success probability) g(f (Z t ) &#8890; &#952; * ). The unregularized cumulative logistic loss can be written as:</p><p>The loss L t (&#952;) is a strongly convex function of &#952; <ref type="bibr">(Abeille et al., 2021;</ref><ref type="bibr">Lee et al., 2024)</ref>. The reward parameter is estimated using maximum likelihood estimation (MLE), defined as &#952;t = arg min ||&#952;||2&#8804;S L t (&#952;). For &#945; &#8712; (0, 1], we use the confidence set:</p><p>where &#946; t (&#945;) = 10n log( St 4n + e) + 2((e -2) + S) log 1 &#945; . Then the following proposition ensures that C t (&#945;) is a confidence set for &#952; * with high probability: Proposition 1 (Theorem 1 in <ref type="bibr">(Lee et al., 2024)</ref>).</p><p>The proof is provided in Appendix E. The proof uses the approach from the online logistic regression regret guarantee of <ref type="bibr">(Foster et al., 2018)</ref> without running the online learning algorithm explicitly. We notice that the radius of the convex confidence set in <ref type="bibr">(Abeille et al., 2021</ref>, Lemma 1) is around O( nS 3 log(t)), while the above tightened loss-based confidence set results in O( (n + S) log(t)), leading to an overall improvement in the factor of S, especially when S is large, e.g., S &#8805; |D t |.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Online learning algorithm</head><p>We consider a constrained stochastic causal logistic bandit over horizon T as described in Section 2.2. The objec-tive for the learner is to maximize the cumulative rewards while minimizing cumulative counterfactual fairness violations over time horizon T . To address the challenges on the unknown reward and the unknown counterfactual fairness constraint, we develop a Constrained Causal Logistic Bandits (CCLB) algorithm by leveraging the primal-dual optimization techniques.</p><p>The pseudo code for our CCLB algorithm is in Algorithm 1. At every round t, let the Lagrangian of the primal problem</p><p>and then the associated dual function is defined as q(&#981;) = max &#960;t L D (&#960; t , &#981;). Since both the reward and counterfactual fairness constraint depend on the unknown parameter &#952; * , we first estimate it through maximal likelihood estimation and construct a confidence set C t (&#945;) using the observed histories, i.e., feature vectors and rewards. The greedy procedure is based on the principle of optimism in the face of uncertainty (OFU) <ref type="bibr">(Auer et al., 2002;</ref><ref type="bibr">2008;</ref><ref type="bibr">Osband &amp; Van Roy, 2014)</ref>, where the optimistic estimate ( &#952;t ) is obtained by maximizing the expected reward across the confidence set C t (&#945;), however, we penalize the expected reward for the constraint violations when the greedy action (d t ) is picked by the learner over the decision domain D t . The dual update that minimizes q(&#981;) with respect to &#981; is by taking a projected gradient descent with 1/&#951; being the step size. Note that the truncated parameter &#961; is chosen to be larger than the optimal dual variable &#981; * , where can be achieved since the optimal dual variable is bounded under the Slater's constraint qualification, specifically, <ref type="bibr">d)</ref>, do(a t ), w t , m t ])/&#948; from Theorem 8.42 in <ref type="bibr">(Beck, 2017)</ref>.</p><p>Remark 1. We remark that the computational complexity of Algorithm 1 is the same as standard algorithms for unconstrained logistic bandits problems <ref type="bibr">(Abeille et al., 2021;</ref><ref type="bibr">Lee et al., 2024)</ref>, since the dual update is executed via a single-step projection, and the primal optimization retains the character of the unconstrained case without constructing a prior safe subset designed for hard constraints as in <ref type="bibr">(Amani et al., 2020)</ref>. Additionally, the reward and counterfactual fairness constraint in our algorithm share the same unknown parameter &#952; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Regret and constraint violations bounds</head><p>In this section, we provide the theoretic upper bounds for both regret and constraint violations of Algorithm 1 and explain the main idea behind the proof of Theorem 1.</p><p>Theorem 1. Suppose &#961; &#8805; 2/&#948;, and &#951; = &#8730; T /&#961;. For 0 &#8804; &#964; &lt; 1, under the Slater's constraint qualification in Assumption 3 and regularity assumptions in Assumption 1 and 2, the CCLB algorithm achieves the following Algorithm 1 CCLB Algorithm 1: Input: Horizon T , truncated parameter &#961;, step size &#951; = &#8730; T /&#961;, and the initial dual value &#981; 1 = 0. 2: for t = 1, 2, 3, . . . , T do 3:</p><p>Use MLE to estimate the reward parameter and build a confidence set C t (&#945;) from Equation (10),</p><p>Greedy procedure. Choose the optimistic reward parameter &#952;t and select the greedy action d t :</p><p>Update the estimates of the dual variable:</p><p>Update the estimation and confidence set according to the new received reward y t+1 . 7: end for bounds simultaneously with probability at least 1 -&#945; for any &#945; &#8712; (0, 1]:</p><p>Remark 2. We remark that: (1) the leading term of our regret O(&#961;nS &#8730; T ) is similar to the bound O(nS T /&#954; * ) established in <ref type="bibr">(Lee et al., 2024)</ref> as the logarithmic growth of T , which improves upon <ref type="bibr">(Abeille et al., 2021)</ref> (OFULog-r) by a factor of S 3/2 and improves upon <ref type="bibr">(Zhang &amp; Sugiyama, 2024)</ref> by at least a factor of &#8730; S. Though it acquires a multiplicative factor &#961;, one could note that, at an extreme case, when the Slater's constant &#948; is optimized to 1/ log(T ), the leading term scales as O(n &#8730; T ).</p><p>(2) Compared to the unconstrained case <ref type="bibr">(Abeille et al., 2021;</ref><ref type="bibr">Zhang &amp; Sugiyama, 2024;</ref><ref type="bibr">Lee et al., 2024)</ref>, the regret bound R + (T ) exhibits an additional term &#961; &#8730; T , which roughly captures the impact of the unknown counterfactual fairness constraint, i.e., a convex combination of logistic functions, which is not logistic function any more. More specifically, the non-convex nature of the logistic mixture introduces a non-linear relationship between the constraint and the reward parameter, thereby resulting in a more complex estimated feasible region of safe decisions at every round. (3) Compared to the constrained generalized linear bandits <ref type="bibr">(Amani et al., 2020)</ref>, our regret bound shows a big improvement on the worst case constant &#954; (see Table <ref type="table">1</ref>). ( <ref type="formula">4</ref>) If &#964; &#8805; 1, the constraint violations bound V(T ) will be zero since the counterfactual fairness constraint is satisfied for all the decisions (see Equation <ref type="formula">5</ref>) and our problem falls into the setting of logistic bandits without constraint.</p><p>See Appendix G for the full proof. We next highlight a few key parts of the proof.</p><p>Proof Sketch of Theorem 1. We first derive the following key decomposition of total regret and constraint violations that holds for any &#981; &#8712; [0, &#961;] : R</p><p>. This is attained by employing a dual variable update and necessary algebraic operations. Note that this bound will serve as the cornerstone for the subsequent analysis of both the regret and constraint violations. To further bound R 1 and R 2 , we apply the following proposition for logistic bandits regret: Proposition 2. With probability at least 1 -&#945; for any &#945; &#8712; (0, 1], under the CCLB algorithm, we have:</p><p>The central idea to obtain the above regret (Proposition 2) is by applying Taylor expansion which tightly link estimation errors (e.g. between &#952; t and &#952; * ) to prediction errors (e.g. between g(f (Z t ) &#8890; &#952; t ) and g(f (Z t ) &#8890; &#952; * )), readers can refer Appendix F for more technical details. As for the logistic bandits regret based on the counterfactual feature vectors, i.e.,</p><p>, we observe that the counterfactual feature vector Z a &#8242; t and the factual feature vector Z t are both lie in the same feature space Z for our problem. Thus,</p><p>) exhibits the same asymptotic upper bound up to logarithmic factors as</p><p>Therefore, the regret upper bound R + (T ) can be obtained by choosing &#981; = 0.</p><p>Inspired by <ref type="bibr">(Beck, 2017)</ref>, we apply tools from constrained convex optimization to obtain the bound on constraint violations V(T ).</p><p>First, we define the the probability distribution &#960;</p><p>where the policy &#960; &#8242; t only puts probability mass (equal to 1) on decision d t chosen by the learner after the observation of contextual information at every round t. Then, we have,</p><p>t , by utilizing <ref type="bibr">(Beck, 2017, Theorem 3.60)</ref>, we obtain the upper bound on V(T ). &#9632;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Improved regret and constraint violations bounds</head><p>In Section 3.3, our analysis demonstrates that the proposed CCLB algorithm (Algorithm 1) achieves both sublinear regret and sublinear constraint violations upper bounds. Another natural question to consider is whether the constraint violations bound can be further improved. It turns out that by introducing a tightness parameter &#1013; in the dual update in Algorithm 1, for &#1013; &lt; &#948;,</p><p>one can achieve a bounded and in some cases even zero constraint violations by trading the regret slightly while still preserving the same asymptotic order of regret as before. Intuitively, with a tightness parameter &#1013; &gt; 0 in the constraint, the learner will be more cautious in selecting actions by effectively working with a stricter constraint (e.g. with fairness threshold &#964; -&#1013; instead of &#964; ). Then, under this new hypothetical pessimistic constraint function, the primal problem is modified as:</p><p>be the optimal solution to this new constrained optimization problem, then we have the following relationship between policy &#960; * t,&#1013; and &#960; * t : Proposition 3. Let policies &#960; * t and &#960; * t,&#1013; be the optimal solutions for the constrained problem</p><p>To further investigate how the user-chosen parameter &#1013; will impact the regret and constraint violations upper bounds, we define the regret associated with the policy &#960; * t,&#1013; as: R &#1013;</p><p>We then state the following theoretical results for R &#1013; + (T ) and V(T ): Theorem 2. Suppose &#961; &#8805; 2/&#948;, and &#951; = &#8730; T /&#961;. For 0 &#8804; &#964; &lt; 1 and the user-chosen parameter &#1013; &#8712; [0, &#948;), under Slater's constraint qualification in Assumption 3 and regularity assumptions in Assumption 1 and 2, the CCLB algorithm with refined constraint condition (see Equation ( <ref type="formula">11</ref>)) attains the following theoretical upper bounds with probability at least 1 -&#945; for any &#945; &#8712; (0, 1] :</p><p>Remark 3. One could notice that (1) by introducing a tightness parameter &#1013; in the dual update, the associated regret R &#1013; + (T ) still achieves a comparable asymptotic upper bound as R + (T ) in Theorem 1; nevertheless, the constraint violations V(T ) upper bound exhibits an &#1013;T reduction compared to the result in Theorem 1. Consequently, by selecting &#1013; appropriately, one can offset the other terms through the subtraction of &#1013;T , thereby obtaining a constant upper bound (with respect to the horizon T ) on the constraint violations.</p><p>(2) The difference in regret bounds between Theorem 2 with a user selected &#1013; &#8712; [0, &#948;) and Theorem 1 is &#961; &#8730; T (2&#1013; + &#1013; 2 ). For a problem-dependent (fixed) Slater's constraint qualification constant &#948; &gt; 0, increasing &#1013; only worsens the regret bound, as the learner is increasingly cautious (increasing in &#1013;), selecting from a smaller set of actions than the learner would have with &#1013; = 0. If &#948; is large, &#961; shrinks towards 2 and so for a fixed tightness &#1013; the regret bound reduces. Larger &#948; also allow for a bigger range of &#1013; and thus more room for caution (and regret). Proposition 4. By conditions stated in Theorem 2, for the user-selected parameter</p><p>are the universal constants independent of n, S, T, &#954; Z , if n &#8805; 2 and &#1013; &#8242; &lt; &#948; for sufficiently large T , then one could achieve a zero upper bound on the constraint violations when selecting &#1013; &#8712; (&#1013; &#8242; , &#948;).</p><p>Note that this user-chosen parameter &#1013; trades off between the upper bounds of the regret and constraint violations <ref type="bibr">(Jenatton et al., 2016)</ref>. Minimizing regret often encourages exploration and adaptability to changing environments, which might lead to occasional violations of constraints. Conversely, strictly adhering to constraints may limit the algorithm's ability to adapt, potentially increasing regret.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Experiments</head><p>We next evaluate the empirical performance of our proposed methods on a synthetic data set. See Appendix H for additional experiments for different values of the constraint threshold &#964; and tightness parameter &#1013;.</p><p>Data set description:<ref type="foot">foot_3</ref> We generated the synthetic dataset from a structural causal model (modified an example from <ref type="bibr">(Plecko &amp; Bareinboim, 2024</ref>)), i.e.,</p><p>As defined in Figure <ref type="figure">1</ref>, A denotes the sensitive attribute (binary valued), W is the confounded feature, M represents the intermediate feature, D &#8712; D is the agent's decision, and Y is the outcome. At every round, we generate a set of 20 feature vectors {[A, W, M, D i ]} 20 i=1 along with their corresponding counterfactual feature vectors. We use rejection sampling over the sets to make sure that at least twelve of the feature vectors are feasible.</p><p>Algorithms:<ref type="foot">foot_4</ref> We evaluate four different algorithms: GLM-UCB <ref type="bibr">(Filippi et al., 2010)</ref> (unconstrained generalized linear bandits ), OFULog+ <ref type="bibr">(Lee et al., 2024)</ref> (unconstrained logistic bandits ), CCLB (our method, causal logistic bandits with counterfactual fairness constraints, Algorithm 1), and &#1013;-CCLB (our method with a user-chosen tightness parameter &#1013;, Algorithm 2).</p><p>Metrics: We evaluated the algorithms using cumulative regret (6), cumulative constraint violations (7), and a penalized form of cumulative regret for different horizons. For the penalized cumulative regret, when the action picked by the learner violates the counterfactual fairness constraint, the learner still observes the reward value (i.e. the learner can improve the reward parameter estimate &#952;), but we count the reward earned as being 0. In this way, constraint violations are allowed but are not (directly) profitable. This penalized form combines the two primary metrics for simpler analysis. penalized cumulative regret (Figure 2 (c))</p><p>, where rewards are only received for fair actions, there is a large gap between our method (CCLB) and the methods of OFULog+ and GLM-UCB, with the gap growing larger for longer horizons. This is expected since GLM-UCB and OFULog+ do not account for constraints. Both baselines have nearly linear penalized cumulative regret across horizons used. OFU-Log+ does because it frequently violates constraints, and thus large cumulative constraint violations, despite learning good actions (for the unconstrained problem). For cumulative regret (unpenalized), OFULog+ performs better than our method (which seeks to satisfy the constraint).</p><p>GLM-UCB performs poorly at identifying good actions within the horizons (Figure <ref type="figure">2</ref> (a)). GLM-UCB's regret bound has a linear dependence on the &#954; (see Table <ref type="table">1</ref>). GLM-UCB is also designed for a more general class of reward functions. Though &#1013;-CCLB has a larger regret than CCLB (but less than GLM-UCB), the cumulative constraint violations of &#1013;-CCLB are much smaller than CCLB, especially, its growth rate is nearly 0 from horizon T = 2, 000 to horizon T = 10, 000, which rarely violates the constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper introduced a framework for logistic bandits with counterfactual fairness constraints built within a causal structure. The proposed approach attains satisfactory results, demonstrating sublinear growth in both regret and constraint violations by effectively balancing exploration and exploitation within the environment via primal-dual optimization. By introducing a user-chosen parameter, one can trade the upper bounds between regret and constraint violations to achieve zero cumulative constraint violations.</p><p>Several promising directions emerge for future research. (1) One important direction is to extend our method to work with unobserved confounders (i.e. W would be unobserved).</p><p>(2) Another interesting direction is to extend our model to handle distribution shifts over time.</p><p>(3) A third interesting direction would be to extend our work to handle budget constraints and consider a fairness notion defined by the resource assignment, potentially building on existing work in bandits with knapsacks <ref type="bibr">(Tran-Thanh et al., 2012;</ref><ref type="bibr">Badanidiyuru et al., 2018;</ref><ref type="bibr">Nie et al., 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Organization of the Appendix</head><p>&#8226; In Appendix A, we recall important notations and introduce some useful functions and results.</p><p>&#8226; In Appendix B, we provide additional related works for our problem.</p><p>&#8226; In Appendix C, we list some technical lemmas, needed for the analysis.</p><p>&#8226; In Appendix D, we provide motivation and proofs for the constrained causal logistic bandits framework.</p><p>&#8226; In Appendix E, we prove the convex confidence set.</p><p>&#8226; In Appendix F, we prove the logistic bandits regret upper bound.</p><p>&#8226; In Appendix G, we prove the total regret and constraint violations upper bounds, and the improved results.</p><p>&#8226; In Appendix H, we provide additional results for the numerical experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Preliminaries</head><p>We first provide a formal definition for d-separation discussed in Section 1.3, Definition 2 (d-separation <ref type="bibr">(Pearl, 2009)</ref>). A path p is said to be d-separated (or blocked) by a set of nodes Z if and only if (1) p contains a chain i &#8594; m &#8594; j or a fork i &#8592; m &#8594; j such that the middle node m is in Z, (2) p contains an intervened fork (or collider) i &#8594; m &#8592; j such that the middle node m is not in Z.</p><p>We then detail below some useful notations that have been used throughout the paper. Below</p><p>1}, &#952; * true reward parameter vector. Y reward variable. X t context vector including the specified attribute, the confounded features, and the intermediate features. f (Z t ) mapping feature vector. &#955; t regularization parameter. &#981; t dual variable. &#961; truncated parameter. &#1013; user-chosen tightness parameter. &#948; Slater's constant. &#945; failure probability. C t (&#945;) confidence set. B n p (1) n-dimensional ball of radius 1 under the &#8467; p norm. || &#8226; || &#8467; 2 norm.</p><p>We further recall and introduce the following functions and use it for the following analysis,</p><p>Where the regularized design matrices V t , H t (&#952; * ), G t (&#952;, &#952; * ), and &#945;(f (Z &#964; ), &#952;t , &#952; * ) are defined for the proof of logistic bandits regret upper bound in Appendix F. In particular, H t (&#952; * ) measures the local behavior of the logistic function through &#289; f (Z &#964; ) &#8890; &#952; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Additional Related Works</head><p>Logistic bandits. The logistic bandits model represents a sequential decision-making framework that has attracted substantial attention within the parametric bandits literature <ref type="bibr">(Li et al., 2010;</ref><ref type="bibr">Filippi et al., 2010;</ref><ref type="bibr">Li et al., 2017;</ref><ref type="bibr">Dong et al., 2019)</ref>. In a recent work, <ref type="bibr">Faury et al. (2020)</ref> proposed an optimistic algorithm based on a finer examination of the non-linearities of the reward function to study the prohibitive linear dependencies introduced by &#954; in the regret upper bound. <ref type="bibr">Abeille et al. (2021)</ref> proved a minimax-optimal rate by deriving an &#8486; n T /&#954; * (T ) problem-dependent lower-bound, which implies that the non-linearity in logistic bandits can ease the exploration-exploitation trade-off in the long-term regime, i.e. &#954; * (T ) &gt; 1. <ref type="bibr">Faury et al. (2022)</ref> addressed the issue of computational tractability while preserving statistical efficiency by designing a new convex confidence set. Additionally, another line of research is the multinomial logit contextual bandit problem <ref type="bibr">(Agrawal et al., 2017;</ref><ref type="bibr">Oh &amp; Iyengar, 2019;</ref><ref type="bibr">Zhang &amp; Sugiyama, 2024;</ref><ref type="bibr">Lee et al., 2024)</ref>, which generalizes the binary logistic bandit by allowing the learner to select a subset of arms. In particular, <ref type="bibr">(Zhang &amp; Sugiyama, 2024;</ref><ref type="bibr">Lee et al., 2024)</ref> also improve the logistic bandits on the regret guarantee (with respect to S) and computational complexity, respectively.</p><p>Fairness. The body of research in fair machine learning is expanding and encompasses a variety of contexts. Within this field, three distinct tasks can be identified: (1) the detection and quantification of biases in currently deployed policies; (2) the development of fair predictive models for outcomes; and (3) the formulation of fair decision-making policies. Our work falls under the setting of online outcome control (task (3)) that explores fairness through a causal lens <ref type="bibr">(Huang et al., 2022a;</ref><ref type="bibr">b;</ref><ref type="bibr">Plecko &amp; Bareinboim, 2023;</ref><ref type="bibr">2025)</ref>. Unlike us, <ref type="bibr">Plecko &amp; Bareinboim (2023;</ref><ref type="bibr">2025)</ref>  Constrained MABs. There is a large body of work on bandits with different types of constraints, including knapsack bandits <ref type="bibr">(Wu et al., 2015;</ref><ref type="bibr">Agrawal &amp; Devanur, 2016)</ref>, submodular maximization <ref type="bibr">(Krause &amp; Guestrin, 2007;</ref><ref type="bibr">Nie et al., 2023)</ref>, bandits with hard safety constraints <ref type="bibr">(Amani et al., 2019;</ref><ref type="bibr">Pacchiano et al., 2021)</ref>, and bandits with cumulative soft constraints <ref type="bibr">(Liu et al., 2021;</ref><ref type="bibr">Zhou &amp; Ji, 2022)</ref>. Among them, the bandit setting with cumulative soft constraints is most closely related to ours in that the goal is also to minimize the cumulative constraint violation. In particular, Zhou &amp; Ji (2022) considered a general unknown reward function and a general unknown constraint function in kernelized bandits via primal-dual optimization. More broadly, this type of constrained problem has also been studied in the reinforcement learning (RL) setting <ref type="bibr">(Efroni et al., 2020;</ref><ref type="bibr">Ding et al., 2021)</ref> where constraints are managed through convex optimization methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Technical Lemmas</head><p>Lemma 1 ((Abeille et al., 2021) Lemma 11). Let {u &#964; } &#8734; &#964; =1 be a sequence in R n such that ||u &#964; || &#8804; B for all &#964; &#8712; N, and let &#955; be a non-negative scalar. For t &#8805; 1 define</p><p>The following inequality holds:</p><p>Proof. By definition of V t :</p><p>Where the second inequality follows by &#955; t &#8805; &#955; t-1 ; the forth inequality comes from Matrix Determinant Lemma. Taking the log on both side of the equation and summing from t = 1 to T :</p><p>Where the second equality is by telescopic sum; and the last equality comes from Lemma 1. Therefore:</p><p>Where the second inequality comes from ||u t || 2 V -1 t &#8804; B 2 /&#955; t ; and the third inequality follows by log(1</p><p>We then state some useful generalized self-concordance results from <ref type="bibr">(Faury et al., 2020, Lemma 9)</ref> and <ref type="bibr">(Abeille et al., 2021, Lemma 7)</ref>. We provide a proof for the sake of completeness (we also use the properties from <ref type="bibr">(Abeille et al., 2021, Lemma 8)</ref>).</p><p>Lemma 3 ((Faury et al., 2020) Lemma 9). Let g be a strictly increasing function such that |g| &#8804; | &#289;|, and let Z be any bounded interval of R. Then, for all z 1 , z 2 &#8712; Z:</p><p>Proof. Since function g is strictly increasing, we have &#289; &gt; 0 for any z &#8712; Z. Therefore:</p><p>where the first line follows from z 0 &#8712; Z. Assume that z 2 &#8805; z 1 , let v &#8805; 0, and set z 0 = z 1 + v(z 2 -z 1 ), we then could easily get:</p><p>Where the second line follows by taking integral of v from 0 to 1 for both sides; and the last line is obtained by using exp(-x) &#8804; (1 + x) -1 if x &#8805; 0. We note that the same inequality can be proved when z 2 &lt; z 1 by following the same steps. &#9632; Lemma 4 (Polynomial Inequality; <ref type="bibr">(Abeille et al., 2021)</ref> Lemma 7). Let b, c &#8712; R + , and u &#8712; R. The following implication holds:</p><p>Then f is a strongly-convex function which roots are:</p><p>If u 2 &#8804; bu + c, then f (u) &lt; 0 and by convexity of f we obtain:</p><p>Where the last inequality is because &#8730; x + y &#8804; &#8730; x + &#8730; y, &#8704;x, y &#8805; 0. &#9632;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Causal Logistic Bandits Framework</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1. Additional motivation example</head><p>Online Recommendation System <ref type="bibr">(Huang et al., 2022b)</ref>. Customers arrive sequentially according to an underlying stochastic distribution, and an online decision-making model selects and recommends a specific item to each incoming individual based on a predefined strategy. In this context, each arm represents a distinct item or content piece available for recommendation to a user. The reward is determined by the user's interaction with the recommended item, such as whether the user clicks on it or not. The fairness constraint mandates that customers with similar profiles receive similar rewards, irrespective of their specific attributes and the particular items being recommended.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2. Derivation for the factual and counterfactual expected reward</head><p>Here, we provide proofs for Equation (2) and Equation ( <ref type="formula">3</ref>), which follow by the do-calculus rule <ref type="bibr">(Pearl, 1995)</ref>.</p><p>where ( <ref type="formula">17</ref>) follows by (D, A &#8869; &#8869; Y |W, M ) G D,A (see Figure <ref type="figure">3b</ref>); (18) follows by denoting Z t as the features from d t , w t , m t and a t ; and (19) follows by the logistic reward assumption (Equation ( <ref type="formula">1</ref>)). As for Equation ( <ref type="formula">3</ref>),</p><p>where ( <ref type="formula">20</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Confidence Sets</head><p>In this section, we provide proofs for the construction of the improved convex confidence set for the estimated bandit parameter presented in Section 3.1. We borrow the techniques from <ref type="bibr">(Lee et al., 2024, Section 3)</ref> to obtain the results.</p><p>Recall the convex confidence set definition:</p><p>where:</p><p>Proposition 1. Let &#945; &#8712; (0, 1], then</p><p>Proof. The proof unfolds through three principal technical components similar with <ref type="bibr">(Lee et al., 2024)</ref>. First, we invoke decomposition identities for the logistic loss, expressing L t (&#952;) -L t ( &#952;t ) as the sum of (i) the regret of the online learning algorithm, (ii) a martingale difference sequence, and (iii) a collection of KL-divergence terms. Second, in controlling the martingale sum, we derive and apply an anytime variant of Freedman's inequality tailored to martingales. Third, to bound the KL-divergence contribution, we fuse the self-concordant analysis of Abeille et al. ( <ref type="formula">2021</ref>) with an information-geometric interpretation of the KL divergence.</p><p>Firstly, we denote &#958; &#964; as a real-valued martingale difference noise where &#958; &#964; = g(f (Z &#964; ) &#8890; &#952; * ) -y &#964; , thus for the logistic loss</p><p>, we have that the following equality holds for any &#952;:</p><p>The equality follows from the first order Taylor expansion with an integral remainder (see <ref type="bibr">(Lee et al., 2024, Appendix C.4</ref>.1) for more details). Setting &#952; to be the optimistic estimate &#952;&#964; and taking a sum over time steps &#964; :</p><p>where in ( <ref type="formula">23</ref>) we add and subtract &#8467; &#964; ( &#952;t ) and in ( <ref type="formula">24</ref>) we rearrange terms. We further define and <ref type="table">&#950; 3</ref>  <ref type="formula">24</ref>), we then have</p><p>)} is the filtration for our bandit model, f (Z &#964; ) and &#952;&#964; are F s-1 -measurable, and &#958; &#964; is a martingale difference sequence w.r.t. F s-1 . Thus, we have that,</p><p>From <ref type="bibr">(Beygelzimer et al., 2011</ref>, Theorem 1), we could apply Freedman's inequality to obtain the following result, Lemma 5. <ref type="bibr">(Lee et al., 2024, Lemma 3)</ref>. Let f (Z 1 ), ..., f (Z t ) be martingale difference sequence satisfying max &#964; |f (Z &#964; )| &#8804; R a.s., and let F &#964; be the &#963;-field generated by (f (Z 1 ), ..., f (Z t ). Then for any &#945; &#8712; (0, 1) and any &#951; &#8712; [0, 1/R], the following holds with probability at least 1 -&#945;:</p><p>Thus, for &#951; &#8712; [0, 1 2S ] to be chosen later, by invoking Lemma 5 for the martingale difference sequence f (Z 1 ), ..., f (Z t ), the following holds with probability at least 1 -&#945;, &#8704;t &#8805; 1:</p><p>Lower Bounding &#950; 2 (t). From the standard result in information geometry <ref type="bibr">(Amari, 2016;</ref><ref type="bibr">Brekelmans et al., 2020)</ref>, we have the following result: Lemma 6. <ref type="bibr">(Lee et al., 2024, Lemma 4)</ref>. Let m(z) := log(1 + e z ) be the log-partition function for the Bernoulli distribution and g(z) = 1 1+e -z . Then, we have that</p><p>where</p><p>Notice that</p><p>Thus, we have the following lower bound on &#950; 2 (t),</p><p>Where ( <ref type="formula">28</ref>) follows by definition of &#950; 2 ; (29) uses Lemma 6; (30) uses ( <ref type="formula">27</ref>); (31) follows by change variables; (32) follows by <ref type="bibr">(Abeille et al., 2021, Lemma 8)</ref>; and (33) follows by Assumption 1 and 2.</p><p>Upper Bounding &#950; 3 (t) <ref type="bibr">(Lee et al., 2024, Theorem 2)</ref>. From <ref type="bibr">(Foster et al., 2018, Theorem 3)</ref>, there exists an (improper learning) algorithm for online logistic regression with the following regret:</p><p>Though our selected decisions are more conservative compared with <ref type="bibr">(Lee et al., 2024)</ref> (add a penalty term when selecting decisions to account for constraint violations), the estimation method to obtain &#952;t (i.e., MLE) and the way to compute the optimistic estimate &#952;t (i.e., &#952;t = arg max &#952;&#8712;Ct max d&#8712;Dt E[Y |do(d), a t , w t , m t ]) are the same as <ref type="bibr">(Lee et al., 2024)</ref>. See the justification for using the improper learning algorithm in <ref type="bibr">(Lee et al., 2024, Appendix B.2)</ref>.</p><p>Combining Equation ( <ref type="formula">25</ref>), ( <ref type="formula">26</ref>), (33), and (34), with &#951; =</p><p>which finishes the proof. &#9632;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Logistic Regret Upper Bounds</head><p>In this section, we provide the proofs for logistic bandits regret upper bounds presented in Section 3.3. Some of the details follow from <ref type="bibr">(Faury et al., 2020, Appendix B)</ref> and <ref type="bibr">(Abeille et al., 2021, Appendix C)</ref>. We first define the regret of logistic bandits, and use which to prove the regret and constraint violations upper bound in Appendix G for our problem:</p><p>where the (35) comes from the expected reward in (2); the (36) is by performing the Taylor Series Expansion of g(f (Z t ) &#8890; &#952;t ) on f (Z t ) &#8890; &#952; * with a first order integral remainder. Then we rewrite the logistic regret R log as R log 1 and R log 2 , where,</p><p>We separately upper bound both terms. Firstly, we prove the following Lemma used throughout this section.</p><p>Lemma 7. With &#955; t = 1 4S 2 (2+2S) , for any &#952; &#8712; C t (&#945;), the following holds with probability at least 1 -&#945;:</p><p>Proof. By proposition 1, we have that with probability at least 1 -&#945;, L t (&#952; * ) -L t ( &#952;t ) &#8804; &#946; t (&#945;) 2 , we assume this event is true throughout this proof. Then, by second-order Taylor expansion of L t (&#952;) around &#952; * ,</p><p>As for the relationship between G t (&#952;, &#952; * ) and H t (&#952; * ), we have the following result:</p><p>Thus, we have that,</p><p>Where (37) follows by</p><p>, by Freedman's inequality (26), for any &#951; &#8712; [0, 1 2BS ], the following holds:</p><p>By <ref type="bibr">(Vershynin,</ref><ref type="bibr">Corollary 4.2.13</ref>) and <ref type="bibr">(Lee et al., 2024, Appendix C.4.4</ref>) the following holds with probability at least 1 -&#945;:</p><p>Choose &#951; = 1 2(e-2)(2+2S) &lt; 1 2S , &#1013; t = n t , and with Equation (37), we finally have:</p><p>Which finishes the proof. &#9632; F.1. The regret upper bound of R log 1 .</p><p>We start by examining R log 1 and show the following upper bounds:</p><p>Where ( <ref type="formula">39</ref>) and ( <ref type="formula">42</ref>) is by Cauchy-Schwarz inequality ( &#289; f (Z t ) &#8890; &#952; * is non-negative); (40) comes from Lemma 7 and ( <ref type="formula">41</ref>) is because &#947; T (&#945;) = max t&#8712;[T ] &#947; t (&#945;); in (43), we define vector</p><p>, and obtain:</p><p>and (44) follows by Lemma 2.</p><p>We then take a look at the first order of the logistic function &#289;(f (Z t ) &#8890; &#952; * ) and derive a upper bound for it by a first-order Taylor expansion:</p><p>Where ( <ref type="formula">45</ref>) comes from the Taylor Expansion; (46) follows by changing variables; (47) is by taking the absolute value; ( <ref type="formula">48</ref>) is because the optimistic estimate at step t, hence, f <ref type="formula">50</ref>) follows by the self-concordance property of logistic function | &#289;| &#8805; |g| and &#289; &gt; 0; (51) follows by ( <ref type="formula">16</ref>) and ( <ref type="formula">52</ref>) is from the fundamental theorem of calculus; and (54)</p><p>Therefore, by ( <ref type="formula">44</ref>) and ( <ref type="formula">54</ref>), we intermediately obtain the following upper bound on R log 1 :</p><p>where ( <ref type="formula">55</ref>) is because</p><p>F.2. The regret upper bounds of R log 2 .</p><p>In order to upper bound the logistic bandits regret R log in (36), we still need to upper bound R log 2 that includes the second order of logistic function:</p><p>Where ( <ref type="formula">57</ref>) follows by changing variables; (58) is because g &#8804; 1; (59) follows by Cauchy-Schwarz inequality; (60) comes from Lemma 7; (61) is by C T (&#945;) = max t&#8712;[T ] C t (&#945;); as for (62), we note that</p><p>where the second line comes from Definition 1. Thus, <ref type="bibr">and (63)</ref>follows by Lemma 2.</p><p>Then by the upper bounds on R log 1 in Equation ( <ref type="formula">55</ref>) and R log 2 in Equation ( <ref type="formula">63</ref>), we then finally upper bound the logistic bandits regret R log in Equation ( <ref type="formula">36</ref>):</p><p>where (66) follows by Lemma 4; and (67) comes from (x + y) 2 &#8804; 2x 2 + 2y 2 . To further simplify the logistic bandits regret R log , we write &#947; t (&#945;) as:</p><p>To get an intuitive understanding on how &#947; t (&#945;) behaves when t grows, we write &#947; T (&#945;) as an asymptotic notation of T , i.e., &#947; T (&#945;) = O(S n log(T )). Therefore, as for the R log , we obtain the following bounds:</p><p>which finishes the proof. &#9632;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G. Regret guarantee and constraint violations</head><p>In this section, we provide proofs for upper bounds of both reward regret and constraint violations. Our proofs build on the greedy procedure in Algorithm 1 and standard convex optimization analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.1. Proof of Theorem 1</head><p>We first prove the regret upper bound. Under Slater's constraint qualification in Assumption 3, we have the boundedness of the optimal dual solution by standard convex optimization analysis from <ref type="bibr">(Beck, 2017, Theorem 8.42)</ref>, where,</p><p>the r.h.s. is because the logistic function is less than 1. Now, we turn to establish a bound over R + (T ) + &#981;V(T ). First, note that,</p><p>Where (69) holds since &#981; t &#8805; 0 and E &#960; * t (|&#8710;(d t )| -&#964; ) &#8804; 0; (71) holds by adding and subtracting</p><p>; and (72) comes from Lemma 8. We are then going to bound R 1 :</p><p>Where the second equality follows by adding and subtracting terms; the third inequality comes from the greedy action d t chosen at step t in Algorithm 1; the forth inequality comes from the optimistic estimate &#952;t in Algorithm 1; and the fifth inequality is because &#981; t &#8804; &#961;; as for the result in the last line, we note that if &#8710;(d) &#8805; 0 then &#8710;(d) &#8805; 0, and by the counterfactual fairness effect (see Equation ( <ref type="formula">4</ref>)), we have,</p><p>where the second line follows by optimistic estimation. On the another hand, if</p><p>, similarly, the second line follows by the optimistic estimate, here, we notice that, the factual feature Z t and the counterfactual feature Z a &#8242; t reside within the feature space Z, in which the boundness assumption (Assumption 1) and problem dependent constant (Definition 1) are defined by. Therefore, the logistic bandits regret of g(f (Z t ) &#952;t ) -g(f (Z t )&#952; * ) has the same asymptotic upper bound up to logarithmic factors as g(f (Z a &#8242; t ) &#952;t ) -g(f (Z a &#8242; t )&#952; * ) (see Equation ( <ref type="formula">68</ref>)). We further bound R 2 . When &#8710;(d) &#8805; 0:</p><p>Where the second equality follows by the counterfactual fairness effect (see Equation ( <ref type="formula">4</ref>)); the forth inequality follows by the optimistic estimate.</p><p>When &#8710;(d) &lt; 0, we have that:</p><p>Since the logistic bandits regret of g(f (Z t ) &#952;t ) -g(f (Z t )&#952; * ) has the same asymptotic upper bound up to logarithmic factors as g(f</p><p>Thus, by Equation ( <ref type="formula">73</ref>) and Equation ( <ref type="formula">74</ref>), the upper bound on R + (T ) + &#981;V(T ) for any &#981; &#8712; [0, &#961;] is the following:</p><p>Regret R + (T ). By setting &#981; = 0, then we obtain the upper bounds on the total regret guarantee with high probability:</p><p>where the second line is because the truncated parameter &#961; &#8805; 2/&#948;; and the last line is to write the regret upper bound in a logarithmic asymptotic notation.</p><p>Constraint violations. Next, to obtain a bound on V(T ), we employ tools from constrained convex optimization. First, we define probability distribution &#960; &#8242; t by</p><p>Then we apply the following theorem from Theorem 3.60 in <ref type="bibr">(Beck, 2017)</ref>.</p><p>Theorem 3. Consider the following convex constrained problem f (&#960; * ) = max &#960;&#8712;&#928; {f (&#960;) : g(&#960;) &#8804; 0}, where both f and g are convex over the convex set &#928; in a vector space. Suppose f (&#960; * ) is finite and there exists a slater point &#960; 0 such that g(&#960; 0 ) &#8804; -&#948;, and a constant &#961; &#8805; 2&#981; * where &#981; * is the optimal dual variable, i.e., &#981; * = argmin &#981;&#8805;0 (max &#960; f (&#960;) -&#981;g(&#960;)).</p><p>Assume that &#960; &#8242; &#8712; &#928; satisfies</p><p>for some &#1013; &gt; 0, then we have [g(&#960; &#8242; )] + &#8804; 2&#1013; &#961; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Since</head><p>Then Equation ( <ref type="formula">76</ref>) satisfies the conditions in Theorem 3 and we have that:</p><p>Where the second line follows by &#981; &#8712; [0, &#961;] and 1/&#961; &lt; 1. &#9632; Lemma 8. Under the dual update of &#981; t in Algorithm 1, we have the following for any &#981; &#8712; [0, &#961;]:</p><p>Proof. By the dual update of &#981; t in Algorithm 1:</p><p>Summing over T steps and multiplying both sides by &#951; 2 :</p><p>Therefore:</p><p>where the second equality comes from telescopic sum; and the forth equality follows by &#981; 1 = 0 and &#951; = &#8730; T /&#961; initialized in Algorithm 1; and the fifth inequality is because</p><p>Proposition 3 states the relationship between policy &#960; * t and &#960; * t,&#1013; for the regret upper bounds, we provide a proof in the following. Proposition 3. Let policies &#960; * t and &#960; * t,&#1013; be the optimal solution for constrained problem</p><p>Proof. The policies &#960; * t and &#960; * t,&#1013; are defined as:</p><p>Let one policy &#960; t,&#1013; = (1 -&#1013; &#948; )&#960; * t + &#1013; &#948; &#960; t,0 , where &#960; t,0 is the policy satisfies the Slater's constrained qualification, i.e.,</p><p>Therefore, &#960; t,&#1013; is a feasible solution of the baseline problem</p><p>Where the first inequality follows by that &#960; * t,&#1013; is the optimal solution while &#960; t,&#1013; is a feasible solution to E &#960;t,&#1013; <ref type="bibr">[E[Y |do(d)</ref>, do(a t ), w t , m t ]] : In this section, we establish upper bounds on both regret and constraint violation for the revised constraint condition (see Algorithm 2). This is achieved by introducing a slackness variable &#1013;, which serves to tighten the constraint. First, we decompose R &#1013; + (T ) + &#981;V &#1013; (T ) as follows, where V &#1013; (T ) = Causal Logistic Bandits with Counterfactual Fairness Constraint = T t=1 E &#960; * t,&#1013; E[Y |do(d), do(a t ), w t , m t ] -E[Y |do(d t ), do(a t ), w t , m t ] + &#981;[|&#8710;(d t )| -&#964; + &#1013;] &#8804; T t=1 E &#960; * t,&#1013; E[Y |do(d), do(a t ), w t , m t ] -E[Y |do(d t ), do(a t ), w t , m t ] + &#981;[|&#8710;(d t )| -&#964; + &#1013;] -&#981; t E &#960; * t,&#1013; [|&#8710;(d)| -&#964; + &#1013;] = T t=1 E &#960; * t,&#1013; E[Y |do(d), do(a t ), w t , m t ] -&#981; t E &#960; * t,&#1013; [|&#8710;(d)| -&#964; + &#1013;] -E[ Y |do(d t ), do(a t ), w t , m t ] + &#981; t [| &#8710;(d t )| -&#964; + &#1013;] R &#1013; 1 + T t=1 E[ Y |do(d t ), do(a t ), w t , m t ] -E[Y |do(d t ), do(a t ), w t , m t ] + &#981; [|&#8710;(d t )| -&#964; + &#1013;] -[| &#8710;(d t )| -&#964; + &#1013;] R &#1013; 2 + T t=1 &#981;[| &#8710;(d t )| -&#964; + &#1013;] -&#981; t [| &#8710;(d t )| -&#964; + &#1013;] . R &#1013; 3 Similar to the techniques in Section G.1, we can upper bound R &#1013; 1 , R &#1013; 2 , R &#1013; 3 as follows: R &#1013; 1 = T t=1 E &#960; * t,&#1013; E[Y |do(d), do(a t ), w t , m t ] -&#981; t E &#960; * t,&#1013; [|&#8710;(d)| -&#964; + &#1013;] -E[ Y |do(d t ), do(a t ), w t , m t ] + &#981; t [| &#8710;(d t )| -&#964; + &#1013;] = &#961; &#8226; O nS log(T ) &#8730; T + n 2 S 2 (log(T )) 2 + n 2 S 2 &#954; Z (log(T )) 2 , R &#1013; 2 = T t=1 E[ Y |do(d t ), do(a t ), w t , m t ] -E[Y |do(d t ), do(a t ), w t , m t ] + &#981; [|&#8710;(d t Regret R &#1013; + (T ). By setting &#981; = 0, we have:</p><p>Constraint violations. By applying <ref type="bibr">(Beck, 2017, Theorem 3.60)</ref>, we have:</p><p>To obtain a bound V(T ), we notice that:</p><p>Which finishes the proof. &#9632; Algorithm 2 &#1013;-CCLB Algorithm 1: Input: Horizon T , truncated interval &#961;, step size &#951; = &#8730; T /&#961;, and the initial dual value &#981; 1 = 0, user-select parameter &#1013; &#8712; [0, &#948;). 2: for t = 1, 2, 3, . . . , T do 3:</p><p>Use MLE to estimate the reward parameter and build a confidence set C t (&#945;) from Equation (10), C t (&#945;) = &#952; &#8712; &#920; : L t (&#952;) -L t ( &#952;t ) &#8804; &#946; t (&#945;) 2 .   we solve it when the right-hand-side is less than 0:</p><p>where &#915; = C 1 nS log(T ) &#8730; T + C 2 n 2 S 2 ((log(T ))) 2 + C 3 &#954; Z n 2 S 2 (log(T )) 2 + C 4 &#8730; T . First, when T is large, the upper bound of &#1013; have the following inequality:</p><p>Since it is larger than the Slater's constant, therefore &#1013; &lt; &#948;. Now we look at the lower bound of &#1013;, If the lower bound &#1013; &#8242; is less than the Slater's constant, then when the learner choose &#1013; &#8712; [&#1013; &#8242; , &#948;), we could achieve zero cumulative constraint violations. &#9632;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H. Additional Experiments</head><p>In this section, we provide additional evaluations for the &#1013;-CCLB and CCLB algorithms when selecting different tightness parameter &#1013; (Figure <ref type="figure">4</ref>) and counterfactual fairness threshold &#964; (Figure <ref type="figure">5</ref>), respectively.    </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>When the variable being intervened on is clear from context, we write do(c) for short notation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>We will drop the dependency on &#952; * when there is no ambiguity.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>E[ Y |do(d), Xt] is the estimated expected reward for decision d and context Xt.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>The source code is available at https://github.com/ jchen-research/CCLB.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>Another potential baseline is(Huang et al., 2022b), which also studied counterfactual fairness in the causal bandits framework, though for a different causal graph. Their code was not available at the time of this work.</p></note>
		</body>
		</text>
</TEI>
