skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning
We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy, respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $$\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$$ regret bound with communication complexity $$\widetilde{\mathcal{O}}(dHM^2)$$, where $$d$$ is the feature dimension, $$H$$ is the horizon length, $$M$$ is the number of agents, and $$K$$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., $$N$$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.  more » « less
Award ID(s):
2323112
PAR ID:
10615552
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Advances in Neural Information Processing Systems
Date Published:
Volume:
37
ISBN:
9798331314385
Page Range / eLocation ID:
74617--74689
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy, respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $$\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$$ regret bound with communication complexity $$\widetilde{\mathcal{O}}(dHM^2)$$, where $$d$$ is the feature dimension, $$H$$ is the horizon length, $$M$$ is the number of agents, and $$K$$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., $$N$$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning. 
    more » « less
  2. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $$\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$$ after $$T$$ steps, where $$S\times A$$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $$T$$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $$T$$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $$\epsilon$$-optimal policy with sample complexity $$\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $$\Omega(SAsp(^*)\epsilon^{-2})$$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $$t_{mix},$$ the worst-case mixing time induced by a policy. We remark that the diameter $$D$$ and mixing time $$t_{mix}$$ are both lower bounded by $sp(h^*)$, and $$t_{mix}$$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $$\gamma$$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $$\gamma$$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  3. Chaudhuri, Kamalika and (Ed.)
    We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $$\widetilde{O}(\sqrt{H^4S^2AT})$$ while requiring a switching cost of $$O(HSA \log\log T)$$. This is an exponential improvement over the best-known switching cost $$O(H^2SA\log T)$$ among existing methods with $$\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$$ regret. In the above, $S,A$ denotes the number of states and actions in an $$H$$-horizon episodic Markov Decision Process model with unknown transitions, and $$T$$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $$\Omega(HSA)$$; (2) Any $$\widetilde{O}(\sqrt{T})$$ regret algorithm must incur a switching cost of $$\Omega(HSA\log\log T)$$. Both our algorithms are thus optimal in their switching costs. 
    more » « less
  4. In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of $${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$$, where $$d$$ is the problem dimension and $$k$$ is the number of iterations. In particular, in the regime where $$k = \bigO(d)$$, our method matches the \emph{optimal rate} of $$\mathcal{O}(\frac{1}{k^2})$$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $$k = \Omega(d \log d)$$, it outperforms NAG and converges at a \emph{faster rate} of $$\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices. 
    more » « less
  5. Monotonicity testing of Boolean functions on the hypergrid, $$f:[n]^d \to \{0,1\}$$, is a classic topic in property testing. Determining the non-adaptive complexity of this problem is an important open question. For arbitrary $$n$$, [Black-Chakrabarty-Seshadhri, SODA 2020] describe a tester with query complexity $$\widetilde{O}(\varepsilon^{-4/3}d^{5/6})$$. This complexity is independent of $$n$$, but has a suboptimal dependence on $$d$$. Recently, [Braverman-Khot-Kindler-Minzer, ITCS 2023] and [Black-Chakrabarty-Seshadhri, STOC 2023] describe $$\widetilde{O}(\varepsilon^{-2} n^3\sqrt{d})$$ and $$\widetilde{O}(\varepsilon^{-2} n\sqrt{d})$$-query testers, respectively. These testers have an almost optimal dependence on $$d$$, but a suboptimal polynomial dependence on $$n$$. \smallskip In this paper, we describe a non-adaptive, one-sided monotonicity tester with query complexity $$O(\varepsilon^{-2} d^{1/2 + o(1)})$$, \emph{independent} of $$n$$. Up to the $$d^{o(1)}$$-factors, our result resolves the non-adaptive complexity of monotonicity testing for Boolean functions on hypergrids. The independence of $$n$$ yields a non-adaptive, one-sided $$O(\varepsilon^{-2} d^{1/2 + o(1)})$$-query monotonicity tester for Boolean functions $$f:\mathbb{R}^d \to \{0,1\}$$ associated with an arbitrary product measure. 
    more » « less