Search for: All records

Award ID contains: 1901252

« Prev Next »

Total Resources

17

Resource Type
Conference Paper

9

Conference Proceeding

1

Dataset

0

Journal Article

7

Workshop Report

0

Availability
Full Text / Resource Available

16

Citation Only

1

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Online Learning in Stackelberg Games with an Omniscient Follower

Zhao, Geng ; Zhu, Banghua ; Jiao, Jiantao ; Jordan, Michael ( August 2023 , Proceedings of Machine Learning Research)

We study the problem of online learning in a two-player decentralized cooperative Stackelberg game. In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader’s move. The goal of the leader is to learn to minimize the cumulative regret based on the history of interactions. Differing from the traditional formulation of repeated Stackelberg games, we assume the follower is omniscient, with full knowledge of the true reward, and that they always best-respond to the leader’s actions. We analyze the sample complexity of regret minimization in this repeated Stackelberg game. We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically, from constant to exponential, even for linear cooperative Stackelberg games.
more » « less
Free, publicly-accessible full text available August 1, 2024
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Zhu, Banghua and ( January 2023 , Proceedings of Machine Learning Research)
Krause, Andreas and (Ed.)
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($K$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $K$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.
more » « less
Full Text Available
Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism

https://doi.org/10.1109/TIT.2022.3185139

Rashidinejad, Paria ; Zhu, Banghua ; Ma, Cong ; Jiao, Jiantao ; Russell, Stuart ( December 2022 , IEEE Transactions on Information Theory)

Full Text Available
Minimax Off-Policy Evaluation for Multi-Armed Bandits

https://doi.org/10.1109/TIT.2022.3162335

Ma, Cong ; Zhu, Banghua ; Jiao, Jiantao ; Wainwright, Martin J. ( August 2022 , IEEE Transactions on Information Theory)

Full Text Available
ROOT-SGD: Sharp Nonasymptotics and Asymptotic Efficiency in a Single Algorithm

Li, C.J. ; Mou, W. ; Wainwright, M. J. ; Jordan, M. I. ( July 2022 , Conference on Computational Learning Theory)

We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of O(n−3/2) under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices.
more » « less
Full Text Available
Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback

Lin, Tianyi ; Pacchiano, Aldo ; Yu, Yaodong ; Jordan, Michael I. ( July 2022 , International Conference on Machine Learning)

Full Text Available
Private Prediction Sets

https://doi.org/10.1162/99608f92.16c71dad

Angelopoulos, Anastasios Nikolas ; Bates, Stephen ; Zrnic, Tijana ; Jordan, Michael I. ( April 2022 , Harvard Data Science Review)

Full Text Available
Computational Benefits of Intermediate Rewards for Goal-Reaching Policy Learning

https://doi.org/10.1613/jair.1.13326

Zhai, Yuexiang ; Baek, Christina ; Zhou, Zhengyuan ; Jiao, Jiantao ; Ma, Yi ( January 2022 , Journal of Artificial Intelligence Research)

Many goal-reaching reinforcement learning (RL) tasks have empirically verified that rewarding the agent on subgoals improves convergence speed and practical performance. We attempt to provide a theoretical framework to quantify the computational benefits of rewarding the completion of subgoals, in terms of the number of synchronous value iterations. In particular, we consider subgoals as one-way intermediate states, which can only be visited once per episode and propose two settings that consider these one-way intermediate states: the one-way single-path (OWSP) and the one-way multi-path (OWMP) settings. In both OWSP and OWMP settings, we demonstrate that adding intermediate rewards to subgoals is more computationally efficient than only rewarding the agent once it completes the goal of reaching a terminal state. We also reveal a trade-off between computational complexity and the pursuit of the shortest path in the OWMP setting: adding intermediate rewards significantly reduces the computational complexity of reaching the goal but the agent may not find the shortest path, whereas with sparse terminal rewards, the agent finds the shortest path at a significantly higher computational cost. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and some popular deep RL algorithms.
more » « less
Full Text Available
Robust Estimation for Nonparametric Families via Generative Adversarial Networks

Zhu, Banghua ; Jiao, Jiantao ; Jordan, Michael ( January 2022 , IEEE International Symposium on Information Theory 2022)

Full Text Available
MADE: Exploration via Maximizing Deviation from Explored Regions

Zhang, Tianjun ; Rashidinejad, Paria ; Jiao, Jiantao ; Tian, Yuandong ; Gonzalez, Joseph E ; Russell, Stuart ( January 2021 , Advances in Neural Information Processing Systems 34 (NeurIPS 2021))

Full Text Available

« Prev Next »