skip to main content


Title: Offline Policy Optimization with Eligible Actions
Offline policy optimization could have a large impact on many real-world decision-making problems, as online learning may be infeasible in many applications. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation, and such estimators typically do not require assumptions on the properties and representational capabilities of value function or decision process model function classes. In this paper, we identify an important overfitting phenomenon in optimizing the importance weighted return, in which it may be possible for the learned policy to essentially avoid making aligned decisions for part of the initial state space. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint, and provide a theoretical justification of the proposed algorithm. We also show the limitations of previous attempts to this approach. We test our algorithm in a healthcare-inspired simulator, a logged dataset collected from real hospitals and continuous control tasks. These experiments show the proposed method yields less overfitting and better test performance compared to state-of-the-art batch reinforcement learning algorithms.  more » « less
Award ID(s):
2112926
NSF-PAR ID:
10382137
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Uncertainty in artificial intelligence
ISSN:
1525-3384
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many transit agencies operating paratransit and microtransit ser-vices have to respond to trip requests that arrive in real-time, which entails solving hard combinatorial and sequential decision-making problems under uncertainty. To avoid decisions that lead to signifi-cant inefficiency in the long term, vehicles should be allocated to requests by optimizing a non-myopic utility function or by batching requests together and optimizing a myopic utility function. While the former approach is typically offline, the latter can be performed online. We point out two major issues with such approaches when applied to paratransit services in practice. First, it is difficult to batch paratransit requests together as they are temporally sparse. Second, the environment in which transit agencies operate changes dynamically (e.g., traffic conditions can change over time), causing the estimates that are learned offline to become stale. To address these challenges, we propose a fully online approach to solve the dynamic vehicle routing problem (DVRP) with time windows and stochastic trip requests that is robust to changing environmental dynamics by construction. We focus on scenarios where requests are relatively sparse-our problem is motivated by applications to paratransit services. We formulate DVRP as a Markov decision process and use Monte Carlo tree search to evaluate actions for any given state. Accounting for stochastic requests while optimizing a non-myopic utility function is computationally challenging; indeed, the action space for such a problem is intractably large in practice. To tackle the large action space, we leverage the structure of the problem to design heuristics that can sample promising actions for the tree search. Our experiments using real-world data from our partner agency show that the proposed approach outperforms existing state-of-the-art approaches both in terms of performance and robustness. 
    more » « less
  2. Modern machine learning has achieved impressive prediction performance, but often sacrifices interpretability, a critical consideration in high-stakes domains such as medicine. In such settings, practitioners often use highly interpretable decision tree models, but these suffer from inductive bias against additive structure. To overcome this bias, we propose Fast Interpretable Greedy-Tree Sums (FIGS), which generalizes the CART algorithm to simultaneously grow a flexible number of trees in summation. By combining logical rules with addition, FIGS is able to adapt to additive structure while remaining highly interpretable. Extensive experiments on real-world datasets show that FIGS achieves state-of-the-art prediction performance. To demonstrate the usefulness of FIGS in high-stakes domains, we adapt FIGS to learn clinical decision instruments (CDIs), which are tools for guiding clinical decision-making. Specifically, we introduce a variant of FIGS known as G-FIGS that accounts for the heterogeneity in medical data. G-FIGS derives CDIs that reflect domain knowledge and enjoy improved specificity (by up to 20% over CART) without sacrificing sensitivity or interpretability. To provide further insight into FIGS, we prove that FIGS learns components of additive models, a property we refer to as disentanglement. Further, we show (under oracle conditions) that unconstrained tree-sum models leverage disentanglement to generalize more efficiently than single decision tree models when fitted to additive regression functions. Finally, to avoid overfitting with an unconstrained number of splits, we develop Bagging-FIGS, an ensemble version of FIGS that borrows the variance reduction techniques of random forests. Bagging-FIGS enjoys competitive performance with random forests and XGBoost on real-world datasets. 
    more » « less
  3. Identifying cause-effect relations among variables is a key step in the decision-making process. Whereas causal inference requires randomized experiments, researchers and policy makers are increasingly using observational studies to test causal hypotheses due to the wide availability of data and the infeasibility of experiments. The matching method is the most used technique to make causal inference from observational data. However, the pair assignment process in one-to-one matching creates uncertainty in the inference because of different choices made by the experimenter. Recently, discrete optimization models have been proposed to tackle such uncertainty; however, they produce 0-1 nonlinear problems and lack scalability. In this work, we investigate this emerging data science problem and develop a unique computational framework to solve the robust causal inference test instances from observational data with continuous outcomes. In the proposed framework, we first reformulate the nonlinear binary optimization problems as feasibility problems. By leveraging the structure of the feasibility formulation, we develop greedy schemes that are efficient in solving robust test problems. In many cases, the proposed algorithms achieve a globally optimal solution. We perform experiments on real-world data sets to demonstrate the effectiveness of the proposed algorithms and compare our results with the state-of-the-art solver. Our experiments show that the proposed algorithms significantly outperform the exact method in terms of computation time while achieving the same conclusion for causal tests. Both numerical experiments and complexity analysis demonstrate that the proposed algorithms ensure the scalability required for harnessing the power of big data in the decision-making process. Finally, the proposed framework not only facilitates robust decision making through big-data causal inference, but it can also be utilized in developing efficient algorithms for other nonlinear optimization problems such as quadratic assignment problems. History: Accepted by Ram Ramesh, Area Editor for Data Science and Machine Learning. Funding: This work was supported by the Division of Civil, Mechanical and Manufacturing Innovation of the National Science Foundation [Grant 2047094]. Supplemental Material: The online supplements are available at https://doi.org/10.1287/ijoc.2022.1226 . 
    more » « less
  4. Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic under- standing of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. We show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further im- prove our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research. 
    more » « less
  5. Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. Reinforcement Learning (RL), a popular approach for DMU problems, learns a policy by interacting with a model of the environment offline. Unfortunately, if the environment changes the policy can become stale and take sub-optimal actions, and relearning the policy for the updated environment takes time and computational effort. An alternative is online planning approaches such as Monte Carlo Tree Search (MCTS), which perform their computation at decision time. Given the current environment, MCTS plans using high-fidelity models to determine promising action trajectories. These models can be updated as soon as environmental changes are detected to immediately incorporate them into decision making. However, MCTS’s convergence can be slow for domains with large state-action spaces. In this paper, we present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses. Our approach, called Policy Augmented MCTS (PA-MCTS), integrates a policy’s actin-value estimates into MCTS, using the estimates to seed the action trajectories favored by the search. We hypothesize that PA-MCTS will converge more quickly than standard MCTS while making better decisions than the policy can make on its own when faced with nonstationary environments. We test our hypothesis by comparing PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole environment. We find that PC-MCTS can achieve higher cumulative rewards than the policy in isolation under several environmental shifts while converging in significantly fewer iterations than pure MCTS. 
    more » « less