NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Contextual Bandits with Stage-wise Constraints

Pacchiano, Aldo; Ghavamzadeh, Mohammad; Bartlett, Peter L (August 2025, Journal of machine learning research)

Free, publicly-accessible full text available August 1, 2026
Contextual Bandits with Stage-wise Constraints

Pacchiano, Aldo; Ghavamzadeh, Mohammad; Bartlett, Peter L (August 2025, Journal of machine learning research)

Free, publicly-accessible full text available August 1, 2026
Multiple-policy Evaluation via Density Estimation

Chen, Yilei; Pacchiano, Aldo; Paschalidis, Ioannis C (May 2025, 42nd International Conference on Machine Learning)

Free, publicly-accessible full text available May 1, 2026
An Instance-Dependent Analysis for the Cooperative Multi-Player Multi-Armed Bandit

Pacchiano, Aldo; Bartlett, Peter L.; Jordan, Michael I. (April 2023, Proceedings of the 34th International Conference on Algorithmic Learning Theory)

Full Text Available
Online Nonsubmodular Minimization with Delayed Costs: From Full Information to Bandit Feedback

Lin, Tianyi; Pacchiano, Aldo; Yu, Yaodong; Jordan, Michael I. (July 2022, International Conference on Machine Learning)

Full Text Available
Stochastic Bandits with Linear Constraints

Pacchiano, Aldo; Ghavamzadeh, Mohammad; Bartlett, Peter L.; Jiang, Heinrich (April 2021, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics)
Banerjee, Arindam; Fukumizu, Kenji (Ed.)
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown.
more » « less
Full Text Available
Near Optimal Policy Optimization via REPS

Pacchiano, Aldo; Lee, Jonathan; Bartlett, Peter L.; Nachum, Ofir (January 2021, Advances in neural information processing systems)

Full Text Available
On the Theory of Reinforcement Learning with Once-per-Episode Feedback

Chatterji, Niladri; Pacchiano, Aldo; Bartlett, Peter L.; Jordan, Michael I. (January 2021, Advances in neural information processing systems)

Full Text Available
On Approximate Thompson Sampling with Langevin Algorithms

Mazumdar, Eric; Pacchiano, Aldo; Ma, Yian; Jordan, Michael; Bartlett, Peter (January 2020, Proceedings of the 37th International Conference on Machine Learning)
Daumé III, Hal; Singh, Aarti (Ed.)
Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, its wider deployment is restricted due to a significant computational limitation: the need for samples from posterior distributions at every iteration. In practice, this limitation is alleviated by making use of approximate sampling methods, yet provably incorporating approximate samples into Thompson Sampling algorithms remains an open problem. In this work we address this by proposing two efficient Langevin MCMC algorithms tailored to Thompson sampling. The resulting approximate Thompson Sampling algorithms are efficiently implementable and provably achieve optimal instance-dependent regret for the Multi-Armed Bandit (MAB) problem. To prove these results we derive novel posterior concentration bounds and MCMC convergence rates for log-concave distributions which may be of independent interest.
more » « less
Full Text Available

Search for: All records