Offline reinforcement learning (offline RL) considers problems where learning is performed using only previously collected samples and is helpful for the settings in which collecting new data is costly or risky. In model-based offline RL, the learner performs estimation (or optimization) using a model constructed according to the empirical transition frequencies. We analyze the sample complexity of vanilla model-based offline RL with dependent samples in the infinite-horizon discounted-reward setting. In our setting, the samples obey the dynamics of the Markov decision process and, consequently, may have interdependencies. Under no assumption of independent samples, we provide a high-probability, polynomial sample complexity bound for vanilla model-based off-policy evaluation that requires partial or uniform coverage. We extend this result to the off-policy optimization under uniform coverage. As a comparison to the model-based approach, we analyze the sample complexity of off-policy evaluation with vanilla importance sampling in the infinite-horizon setting. Finally, we provide an estimator that outperforms the sample-mean estimator for almost deterministic dynamics that are prevalent in reinforcement learning.
more »
« less
This content will become publicly available on February 21, 2026
Off-line Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity
New Insights into Off-line Estimation for Controlled Markov Chains Unveiled A team of researchers from Purdue and Northwestern Universities have unveiled new findings in off-line estimation for controlled Markov chains, addressing challenges in analyzing complex data generated under arbitrary dynamics. The study introduces a nonparametric estimator for transition probabilities, showcasing its robustness even in nonstationary, non-Markovian environments. The team developed precise sample complexity bounds, revealing a delicate interplay between mixing properties of the logging policy and data set size. Their analysis highlights how achieving optimal statistical risk depends on this trade-off, broadening the scope of off-line estimation under diverse conditions. Examples include ergodic and weakly ergodic chains as well as controlled chains with episodic or greedy controls. Significantly, this research confirms that the widely used estimator, which calculates state–action transition ratios, is minimax optimal, ensuring its reliability in general scenarios. This advancement paves the way for improved evaluation of stationary Markov control policies, marking a breakthrough in understanding complex off-line systems.
more »
« less
- Award ID(s):
- 2143752
- PAR ID:
- 10613259
- Publisher / Repository:
- INFORMS
- Date Published:
- Journal Name:
- Operations Research
- ISSN:
- 0030-364X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider a family of Markov chains whose transition dynamics are affected by model parameters. Understanding the parametric dependence of (complex) performance measures of such Markov chains is often of significant interest. The derivatives and their continuity of the performance measures w.r.t. the parameters play important roles, for example, in numerical optimization of the performance measures, and quantification of the uncertainties in the performance measures when there are uncertainties in the parameters from the statistical estimation procedures. In this paper, we establish conditions that guarantee the smoothness of various types of intractable performance measures—such as the stationary and random horizon discounted performance measures—of general state space Markov chains and provide probabilistic representations for the derivatives. Funding: C.-H. Rhee is supported by the National Science Foundation [Grant CMMI-2146530].more » « less
-
We propose a nonconvex estimator for the covariate adjusted precision matrix estimation problem in the high dimensional regime, under sparsity constraints. To solve this estimator, we propose an alternating gradient descent algorithm with hard thresholding. Compared with existing methods along this line of research, which lack theoretical guarantees in optimization error and/or statistical error, the proposed algorithm not only is computationally much more efficient with a linear rate of convergence, but also attains the optimal statistical rate up to a logarithmic factor. Thorough experiments on both synthetic and real data support our theory.more » « less
-
Key features of biological activity can often be captured by transitions between a finite number of semi-stable states that correspond to behaviors or decisions. We present here a broad class of dynamical systems that are ideal for modeling such activity. The models we propose are chaotic heteroclinic networks with nontrivial intersections of stable and unstable manifolds. Due to the sensitive dependence on initial conditions, transitions between states are seemingly random. Dwell times, exit distributions, and other transition statistics can be built into the model through geometric design and can be controlled by tunable parameters. To test our model’s ability to simulate realistic biological phenomena, we turned to one of the most studied organisms, C. elegans, well known for its limited behavioral states. We reconstructed experimental data from two laboratories, demonstrating the model’s ability to quantitatively reproduce dwell times and transition statistics under a variety of conditions. Stochastic switching between dominant states in complex dynamical systems has been extensively studied and is often modeled as Markov chains. As an alternative, we propose here a new paradigm, namely, chaotic heteroclinic networks generated by deterministic rules (without the necessity for noise). Chaotic heteroclinic networks can be used to model systems with arbitrary architecture and size without a commensurate increase in phase dimension. They are highly flexible and able to capture a wide range of transition characteristics that can be adjusted through control parameters.more » « less
-
Abstract To tackle massive data, subsampling is a practical approach to select the more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is challenging, and an optimal sampling approach under measurement constraints was developed to meet this challenge. This method uses the inverses of optimal sampling probabilities to reweight the objective function, which assigns smaller weights to the more important data points. Thus, the estimation efficiency of the resulting estimator can be improved. In this paper, we propose an unweighted estimating procedure based on optimal subsamples to obtain a more efficient estimator. We obtain the unconditional asymptotic distribution of the estimator via martingale techniques without conditioning on the pilot estimate, which has been less investigated in the existing subsampling literature. Both asymptotic results and numerical results show that the unweighted estimator is more efficient in parameter estimation.more » « less
An official website of the United States government
