skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees
Researchers have introduced a new algorithm to estimate structural models of dynamic decisions by human agents, addressing the challenge of high computational complexity. Traditionally, this task involves a nested structure: an inner problem identifying an optimal policy and an outer problem maximizing a measure of fit. Previous methods have struggled with large discrete state spaces or high-dimensional continuous state spaces, often sacrificing reward estimation accuracy. The new approach combines policy improvement with a stochastic gradient step for likelihood maximization, ensuring accurate reward estimation without compromising computational efficiency. This single-loop algorithm, designed to handle high-dimensional state spaces, converges to a stationary solution with finite-time guarantees. When the reward is linearly parameterized, it approximates the maximum likelihood estimator sublinearly, offering a robust solution for complex decision modeling tasks.  more » « less
Award ID(s):
1910385 2414372 2426064
PAR ID:
10546631
Author(s) / Creator(s):
; ;
Publisher / Repository:
INFORMS
Date Published:
Journal Name:
Operations Research
ISSN:
0030-364X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Social networks are frequently polluted by rumors, which can be detected by advanced models such as graph neural networks. However, the models are vulnerable to attacks, and discovering and understanding the vulnerabilities is critical to robust rumor detection. To discover subtle vulnerabilities, we design a attacking algorithm based on reinforcement learning to camouflage rumors against black-box detectors. We address exponentially large state spaces, high-order graph dependencies, and ranking dependencies, which are unique to the problem setting but fundamentally challenging for the state-of-the-art end-to-end approaches. We design domain-specific features that have causal effect on the reward, so that even a linear policy can arrive at powerful attacks with additional interpretability. To speed up policy optimization, we devise: (i) a credit assignment method that proportionally decomposes delayed and aggregated rewards to atomic attacking actions for enhance feature-reward associations; (ii) a time-dependent control variate to reduce prediction variance due to large state-action spaces and long attack horizon, based on reward variance analysis and a Bayesian analysis of the prediction distribution. On two real world datasets of rumor detection tasks, we demonstrate: (i) the effectiveness of the learned attacking policy on a wide spectrum of target models compared to both rule-based and end-to-end attacking approaches; (ii) the usefulness of the proposed credit assignment strategy and variance reduction components; (iii) the interpretability of the attacking policy. 
    more » « less
  2. In practice, it is essential to compare and rank candidate policies offline before real-world deployment for safety and reliability. Prior work seeks to solve this offline policy ranking (OPR) problem through value-based methods, such as Off-policy evaluation (OPE). However, they fail to analyze special case performance (e.g., worst or best cases), due to the lack of holistic characterization of policies’ performance. It is even more difficult to estimate precise policy values when the reward is not fully accessible under sparse settings. In this paper, we present Probabilistic Offline Policy Ranking (POPR), a framework to address OPR problems by leveraging expert data to characterize the probability of a candidate policy behaving like experts, and approximating its entire performance posterior distribution to help with ranking. POPR does not rely on value estimation, and the derived performance posterior can be used to distinguish candidates in worst-, best-, and average-cases. To estimate the posterior, we propose POPR-EABC, an Energy-based Approximate Bayesian Computation (ABC) method conducting likelihood-free inference. POPR-EABC reduces the heuristic nature of ABC by a smooth energy function, and improves the sampling efficiency by a pseudo-likelihood. We empirically demonstrate that POPR-EABC is adequate for evaluating policies in both discrete and continuous action spaces across various experiment environments, and facilitates probabilistic comparisons of candidate policies before deployment. 
    more » « less
  3. Suppose an online platform wants to compare a treatment and control policy (e.g., two different matching algorithms in a ridesharing system, or two different inventory management algorithms in an online retail site). Standard experimental approaches to this problem are biased (due to temporal interference between the policies), and not sample efficient. We study optimal experimental design for this setting. We view testing the two policies as the problem of estimating the steady state difference in reward between two unknown Markov chains (i.e., policies). We assume estimation of the steady state reward for each chain proceeds via nonparametric maximum likelihood, and search for consistent (i.e., asymptotically unbiased) experimental designs that are efficient (i.e., asymptotically minimum variance). Characterizing such designs is equivalent to a Markov decision problem with a minimum variance objective; such problems generally do not admit tractable solutions. Remarkably, in our setting, using a novel application of classical martingale analysis of Markov chains via Poisson's equation, we characterize efficient designs via a succinct convex optimization problem. We use this characterization to propose a consistent, efficient online experimental design that adaptively samples the two Markov chains. 
    more » « less
  4. We study offline reinforcement learning (RL) with heavy-tailed reward distribution and data corruption: (i) Moving beyond subGaussian reward distribution, we allow the rewards to have infinite variances; (ii) We allow corruptions where an attacker can arbitrarily modify a small fraction of the rewards and transitions in the dataset. We first derive a sufficient optimality condition for generalized Pessimistic Value Iteration (PEVI), which allows various estimators with proper confidence bounds and can be applied to multiple learning settings. In order to handle the data corruption and heavy-tailed reward setting, we prove that the trimmed-mean estimation achieves the minimax optimal error rate for robust mean estimation under heavy-tailed distributions. In the PEVI algorithm, we plug in the trimmed mean estimation and the confidence bound to solve the robust offline RL problem. Standard analysis reveals that data corruption induces a bias term in the suboptimality gap, which gives the false impression that any data corruption prevents optimal policy learning. By using the optimality condition for the generalized PEVI, we show that as long as the bias term is less than the ``action gap'', the policy returned by PEVI achieves the optimal value given sufficient data. 
    more » « less
  5. Markov games model interactions among multiple players in a stochastic, dynamic environment. Each player in a Markov game maximizes its expected total discounted reward, which depends upon the policies of the other players. We formulate a class of Markov games, termed affine Markov games, where an affine reward function couples the players’ actions. We introduce a novel solution concept, the soft-Bellman equilibrium, where each player is boundedly rational and chooses a soft-Bellman policy rather than a purely rational policy as in the well-known Nash equilibrium concept. We provide conditions for the existence and uniqueness of the soft-Bellman equilibrium and propose a nonlinear least-squares algorithm to compute such an equilibrium in the forward problem. We then solve the inverse game problem of inferring the players’ reward parameters from observed state-action trajectories via a projected-gradient algorithm. Experiments in a predator-prey OpenAI Gym environment show that the reward parameters inferred by the proposed algorithm outper- form those inferred by a baseline algorithm: they reduce the Kullback-Leibler divergence between the equilibrium policies and observed policies by at least two orders of magnitude. 
    more » « less