Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « lessFree, publiclyaccessible full text available July 1, 2024

In this paper, we consider a largescale heterogeneous mobile edge computing system, where each device’s mean computing task arrival rate, mean service rate, mean energy consumption, and mean offloading latency are drawn from different bounded continuous probability distributions to reflect the diverse computeintensive applications, mobile devices with different computing capabilities and battery efficiencies, and different types of wireless access networks (e.g., 4G/5G cellular networks, WiFi). We consider a class of distributed thresholdbased randomized offloading policies and develop a threshold update algorithm based on its computational load, average offloading latency, average energy consumption, and edge server processing time, depending on the server utilization. We show that there always exists a unique MeanField Nash Equilibrium (MFNE) in the largesystem limit when the task processing times of mobile devices follow an exponential distribution. This is achieved by carefully partitioning the space of mean arrival rates to account for the discrete structure of each device’s optimal threshold. Moreover, we show that our proposed threshold update algorithm converges to the MFNE. Finally, we perform simulations to corroborate our theoretical results and demonstrate that our proposed algorithm still performs well in more general setups based on the collected realworld data and outperforms the wellknown probabilistic offloading policy.more » « lessFree, publiclyaccessible full text available July 1, 2024

In offline multiagent reinforcement learning (MARL), agents estimate policies from a given dataset. We study rewardpoisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the Lp norm of the reward modification. Unlike attacks on singleagent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate singleagent attacks. We show that the attack works on various MARL agents including uncertaintyaware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.more » « lessFree, publiclyaccessible full text available June 27, 2024

Free, publiclyaccessible full text available June 19, 2024

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an εoptimal policy is Ω(SAH/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in S and A due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LRMCPI) and Low Rank Empirical Value Iteration (LREVI) achieve the desired sample complexity of Õ((S+A)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of S, A, and ε. In contrast to literature on linear and lowrank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal lowrank structural assumptions required on the MDP with respect to the transition kernel versus the optimal actionvalue function.more » « lessFree, publiclyaccessible full text available May 19, 2024

Free, publiclyaccessible full text available May 1, 2024

Abstract Network interference, where the outcome of an individual is affected by the treatment assignment of those in their social network, is pervasive in realworld settings. However, it poses a challenge to estimating causal effects. We consider the task of estimating the total treatment effect (TTE), or the difference between the average outcomes of the population when everyone is treated versus when no one is, under network interference. Under a Bernoulli randomized design, we provide an unbiased estimator for the TTE when network interference effects are constrained to loworder interactions among neighbors of an individual. We make no assumptions on the graph other than bounded degree, allowing for wellconnected networks that may not be easily clustered. We derive a bound on the variance of our estimator and show in simulated experiments that it performs well compared with standard estimators for the TTE. We also derive a minimax lower bound on the mean squared error of our estimator, which suggests that the difficulty of estimation can be characterized by the degree of interactions in the potential outcomes model. We also prove that our estimator is asymptotically normal under boundedness conditions on the network degree and potential outcomes model. Central to our contribution is a new framework for balancing model flexibility and statistical complexity as captured by this loworder interactions structure.more » « less

We consider the problem of dividing limited resources to individuals arriving over T rounds. Each round has a random number of individuals arrive, and individuals can be characterized by their type (i.e., preferences over the different resources). A standard notion of fairness in this setting is that an allocation simultaneously satisfy envyfreeness and efficiency. The former is an individual guarantee, requiring that each agent prefers the agent’s own allocation over the allocation of any other; in contrast, efficiency is a global property, requiring that the allocations clear the available resources. For divisible resources, when the number of individuals of each type are known up front, the desiderata are simultaneously achievable for a large class of utility functions. However, in an online setting when the number of individuals of each type are only revealed round by round, no policy can guarantee these desiderata simultaneously, and hence, the best one can do is to try and allocate so as to approximately satisfy the two properties. We show that, in the online setting, the two desired properties (envyfreeness and efficiency) are in direct contention in that any algorithm achieving additive counterfactual envyfreeness up to a factor of L T necessarily suffers an efficiency loss of at least [Formula: see text]. We complement this uncertainty principle with a simple algorithm, GuardedHope, which allocates resources based on an adaptive threshold policy and is able to achieve any fairness–efficiency point on this frontier. Our results provide guarantees for fair online resource allocation with high probability for multiple resource and multiple type settings. In simulation results, our algorithm provides allocations close to the optimal fair solution in hindsight, motivating its use in practical applications as the algorithm is able to adapt to any desired fairness efficiency tradeoff. Funding: This work was supported by the National Science Foundation [Grants ECCS1847393, DMS1839346, CCF1948256, and CNS1955997] and the Army Research Laboratory [Grant W911NF1710094]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2397 .more » « less

Discretizationbased approaches to solving online reinforcement learning problems are studied extensively on applications such as resource allocation and cache management. The two major questions in designing discretizationbased algorithms are how to create the discretization and when to refine it. There are several experimental results investigating heuristic approaches to these questions but little theoretical treatment. In this paper, we provide a unified theoretical analysis of modelfree and modelbased, treebased adaptive hierarchical partitioning methods for online reinforcement learning. We show how our algorithms take advantage of inherent problem structure by providing guarantees that scale with respect to the “zooming” instead of the ambient dimension, an instancedependent quantity measuring the benignness of the optimal [Formula: see text] function. Many applications in computing systems and operations research require algorithms that compete on three facets: low sample complexity, mild storage requirements, and low computational burden for policy evaluation and training. Our algorithms are easily adapted to operating constraints, and our theory provides explicit bounds across each of the three facets. Funding: This work is supported by funding from the National Science Foundation [Grants ECCS1847393, DMS1839346, CCF1948256, and CNS1955997] and the Army Research Laboratory [Grant W911NF1710094]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2396 .more » « less

Randomized experiments are widely used to estimate the causal effects of a proposed treatment in many areas of science, from medicine and healthcare to the physical and biological sciences, from the social sciences to engineering, and from public policy to the technology industry. Here we consider situations where classical methods for estimating the total treatment effect on a target population are considerably biased due to confounding network effects, i.e., the fact that the treatment of an individual may impact its neighbors’ outcomes, an issue referred to as network interference or as nonindividualized treatment response. A key challenge in these situations is that the network is often unknown and difficult or costly to measure. We assume a potential outcomes model with heterogeneous additive network effects, encompassing a broad class of network interference sources, including spillover, peer effects, and contagion. First, we characterize the limitations in estimating the total treatment effect without knowledge of the network that drives interference. By contrast, we subsequently develop a simple estimator and efficient randomized design that outputs an unbiased estimate with low variance in situations where one is given access to average historical baseline measurements prior to the experiment. Our solution does not require knowledge of the underlying network structure, and it comes with statistical guarantees for a broad class of models. Due to their ease of interpretation and implementation, and their theoretical guarantees, we believe our results will have significant impact on the design of randomized experiments.more » « less