skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Blanchet, Jose"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available December 1, 2025
  2. Salakhutdinov, Ruslan; Kolter, Zico; Heller, Katherine; Weller, Adrian; Oliver, Nuria; Scarlett, Jonathan; Berkenkamp, Felix (Ed.)
    To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm’s complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms. 
    more » « less
  3. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less
  4. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less
  5. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less
  6. We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of O(|S||A|t2mixϵ−2)* and a lower bound of Ω(|S||A|tmixϵ−2). In these expressions, |S| and |A| denote the cardinalities of the state and action spaces respectively, tmix serves as a uniform upper limit for the total variation mixing times, and ϵ signifies the error tolerance. Therefore, a notable gap of tmix still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of eO(|S||A|tmixϵ−2). This marks the first algorithm and analysis to reach the literature’s lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin & Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings. 
    more » « less
  7. We consider a generic class of chance-constrained optimization problems with heavy-tailed (i.e., power-law type) risk factors. As the most popular generic method for solving chance constrained optimization, the scenario approach generates sampled optimization problem as a precise approximation with provable reliability, but the computational complexity becomes intractable when the risk tolerance parameter is small. To reduce the complexity, we sample the risk factors from a conditional distribution given that the risk factors are in an analytically tractable event that encompasses all the plausible events of constraints violation. Our approximation is proven to have optimal value within a constant factor to the optimal value of the original chance constraint problem with high probability, uniformly in the risk tolerance parameter. To the best of our knowledge, our result is the first uniform performance guarantee of this type. We additionally demonstrate the efficiency of our algorithm in the context of solvency in portfolio optimization and insurance networks. Funding: The research of B. Zwart is supported by the NWO (Dutch Research Council) [Grant 639.033.413]. The research of J. Blanchet is supported by the Air Force Office of Scientific Research [Award FA9550-20-1-0397], the National Science Foundation [Grants 1820942, 1838576, 1915967, and 2118199], Defense Advanced Research Projects Agency [Award N660011824028], and China Merchants Bank. 
    more » « less
  8. We prove a sample-path large deviation principle (LDP) with sublinear speed for unbounded functionals of certain Markov chains induced by the Lindley recursion. The LDP holds in the Skorokhod space [Formula: see text] equipped with the [Formula: see text] topology. Our technique hinges on a suitable decomposition of the Markov chain in terms of regeneration cycles. Each regeneration cycle denotes the area accumulated during the busy period of the reflected random walk. We prove a large deviation principle for the area under the busy period of the Markov random walk, and we show that it exhibits a heavy-tailed behavior. Funding: The research of B. Zwart and M. Bazhba is supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek [Grant 639.033.413]. The research of J. Blanchet is supported by the National Science Foundation (NSF) [Grants 1915967, 1820942, and 1838576] as well as the Defense Advanced Research Projects Agency [Grant N660011824028]. The research of C.-H. Rhee is supported by the NSF [Grant CMMI-2146530]. 
    more » « less
  9. To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools.Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms. @inproceedings{ liang2024singletrajectory, title={Single-Trajectory Distributionally Robust Reinforcement Learning}, author={Zhipeng Liang and Xiaoteng Ma and Jose Blanchet and Jun Yang and Jiheng Zhang and Zhengyuan Zhou}, booktitle={Forty-first International Conference on Machine Learning}, year={2024}, url={https://openreview.net/forum?id=3B6vmW2L80} } 
    more » « less
  10. Salakhutdinov, Ruslan; Kolter, Zico; Heller, Katherine; Weller, Adrian; Oliver, Nuria; Scarlett, Jonathan; Berkenkamp, Felix (Ed.)
    To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms. 
    more » « less