When the operation and maintenance (O&M) of infrastructure components is modeled as a Markov Decision Process (MDP), the stochastic evolution following the optimal policy is completely described by a Markov transition matrix. This paper illustrates how to predict relevant features of the time evolution of these controlled components. We are interested in assessing if a critical state is reachable, in assessing the probability of reaching that state within a time period, of visiting that state before another, and in returning to that state. We present analytical methods to address these questions and discuss their computational complexity. Outcomes of these analyses can provide the decision makers with deeper understanding of the component evolution and suggest revising the control policy. We formulate the framework for MDPs and extend it to Partially Observable Markov Decision Processes (POMDPs).
Predicting the Evolution of Controlled Systems Modeled by Finite Markov Processes
The operation and maintenance of infrastructure components and systems can be modeled as a Markov process, partially or fully observable. Information about the current condition can be summarized by the “inner” state of a finite state controller. When a control policy is assigned, the stochastic evolution of the system is completely described by a Markov transition function. This article applies finite state Markov chain analyses to identify relevant features of the time evolution of a controlled system. We focus on assessing if some critical conditions are reachable (or if some actions will ever be taken), in identifying the probability of these critical events occurring within a time period, their expected time of occurrence, their longterm frequency, and the probability that some events occur before others. We present analytical methods based on linear algebra to address these questions, discuss their computational complexity and the structure of the solution. The analyses can be performed after a policy is selected for a Markov decision process (MDP) or a partially observable MDP. Their outcomes depend on the selected policy and examining these outcomes can provide the decision makers with deeper understanding of the consequences of following that policy, and may also suggest revising it.
 Award ID(s):
 1663479
 Publication Date:
 NSFPAR ID:
 10298075
 Journal Name:
 IEEE Transactions on Reliability
 Page Range or eLocationID:
 1 to 19
 ISSN:
 00189529
 Sponsoring Org:
 National Science Foundation
More Like this


Green wireless networks Wakeup radio Energy harvesting Routing Markov decision process Reinforcement learning 1. Introduction With 14.2 billions of connected things in 2019, over 41.6 billions expected by 2025, and a total spending on endpoints and services that will reach well over $1.1 trillion by the end of 2026, the Internet of Things (IoT) is poised to have a transformative impact on the way we live and on the way we work [1–3]. The vision of this ‘‘connected continuum’’ of objects and people, however, comes with a wide variety of challenges, especially for those IoT networks whose devices rely on some forms of depletable energy support. This has prompted research on hardware and software solutions aimed at decreasing the depen dence of devices from ‘‘prepackaged’’ energy provision (e.g., batteries), leading to devices capable of harvesting energy from the environment, and to networks – often called green wireless networks – whose lifetime is virtually infinite. Despite the promising advances of energy harvesting technologies, IoT devices are still doomed to run out of energy due to their inherent constraints on resources such as storage, processing and communica tion, whose energy requirements often exceed what harvesting can provide. The communication circuitry of prevailingmore »

This paper presents a framework to learn the reward function underlying highlevel sequential tasks from demonstrations. The purpose of reward learning, in the context of learning from demonstration (LfD), is to generate policies that mimic the demonstrator’s policies, thereby enabling imitation learning. We focus on a humanrobot interaction(HRI) domain where the goal is to learn and model structured interactions between a human and a robot. Such interactions can be modeled as a partially observable Markov decision process (POMDP) where the partial observability is caused by uncertainties associated with the ways humans respond to different stimuli. The key challenge in finding a good policy in such a POMDP is determining the reward function that was observed by the demonstrator. Existing inverse reinforcement learning(IRL) methods for POMDPs are computationally very expensive and the problem is not well understood. In comparison, IRL algorithms for Markov decision process (MDP) are well defined and computationally efficient. We propose an approach of reward function learning for highlevel sequential tasks from human demonstrations where the core idea is to reduce the underlying POMDP to an MDP and apply any efficient MDPIRL algorithm. Our extensive experiments suggest that the reward function learned this way generates POMDP policies thatmore »

Motivated by the many realworld applications of reinforcement learning (RL) that require safepolicy iterations, we consider the problem of offpolicy evaluation (OPE) — the problem of evaluating a new policy using the historical data ob tained by different behavior policies — under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a meansquared error of [ ] where μ and π are the logging and target policies, dμt (st) and dπt (st) are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and V π is the value function of the MDP under π. The result matches the t+1 CramerRao lower bound in Jiang and Li [2016] up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H . Besidesmore »

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsicmore »