NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cost-Aware Near-Optimal Policy Learning

He_Yueya, J; Lee, J; Jorke, M; Brunskill, E (February 2025, Associate for the Advancement of Artificial Intelligence)

Free, publicly-accessible full text available February 19, 2026
OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators

Nie, A; Chandak, Y; Yuan, C; Badrinath, A; Flet-Berliac, Y; Brunskill, E (December 2024, Neural Information Processing Systems Proceedings)

Free, publicly-accessible full text available December 12, 2025
Experiment Planning with Function Approximation

Pacchiano, A; Lee, J; Brunskill, E (December 2023, Neural Information Processing Systems)

We study the problem of experiment planning with function approximation in contextual bandit problems. In settings where there is a significant overhead to deploying adaptive algorithms—for example, when the execution of the data collection policies is required to be distributed, or a human in the loop is needed to implement these policies—producing in advance a set of policies for data collection is paramount. We study the setting where a large dataset of contexts but not rewards is available and may be used by the learner to design an effective data collection strategy. Although when rewards are linear this problem has been well studied [53], results are still missing for more complex reward models. In this work we propose two experiment planning strategies compatible with function approximation. The first is an eluder planning and sampling procedure that can recover optimality guarantees depending on the eluder dimension [42] of the reward function class. For the second, we show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small. We finalize our results introducing a statistical gap fleshing out the fundamental differences between planning and adaptive learning and provide results for planning with model selection.
more » « less
Full Text Available
Estimating Optimal Policy Value in Linear Contextual Bandits Beyond Gaussianity

Lee, J; Pacchiano, A; Kong, W; Muthukumar, V; Brunskill, E (February 2024, Transactions on machine learning research)

Full Text Available
Proportional Response: Contextual Bandits for Simple and Cumulative Regret Minimization

Krishnamurthy, S; Zhan, R; Athey, S; Brunskill, E (December 2023, Neural Information Processing Systems Proceedings)

Full Text Available
Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets

Badrinath, A; Flet-Berliac, Y; Nie, A; Brunskill, E (December 2023, Neural Information {rocessing Systems proceedings)

Full Text Available
Reinforcement learning tutor better supported lower performers in a math task.

Ruan, S; Nie, A; Steenbergen, W; He, J; Zhang, JQ; Guo, M; Liu, Y; Dang_Nguyen, K; Wang, CY; Ying, R; et al (February 2024, Machine learning)

Resource limitations make it challenging to provide all students with one of the most effec- tive educational interventions: personalized instruction. Reinforcement learning could be a pivotal tool to decrease the development costs and enhance the effectiveness of intelligent tutoring software, that aims to provide the right support, at the right time, to a student. Here we illustrate that deep reinforcement learning can be used to provide adaptive peda- gogical support to students learning about the concept of volume in a narrative storyline software. Using explainable artificial intelligence tools, we extracted interpretable insights about the pedagogical policy learned and demonstrated that the resulting policy had simi- lar performance in a different student population. Most importantly, in both studies, the reinforcement-learning narrative system had the largest benefit for those students with the lowest initial pretest scores, suggesting the opportunity for AI to adapt and provide support for those most in need.
more » « less
Full Text Available
Offline Policy Optimization with Eligible Actions

Liu, Y; Flet-Berliac, Y; Brunskill, E. (January 2022, Uncertainty in artificial intelligence)

Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.
more » « less
Full Text Available
Oracle Inequalities for Model Selection in Offline Reinforcement Learning

Lee, J.; Tucker, G.; Nachum, O.; Dai, B.; Brunskill, E. (January 2022, Advances in neural information processing systems)

In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, MODBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, MODBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
more » « less
Full Text Available
Universal Off-Policy Evaluation

Chandak, Y; Niekum, S; Castro da Silva, B; Learned-Miller, E; Brunskill, E; Thomas, P (December 2021, Neural Information Processing Systems)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records