The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an εoptimal policy is Ω(SAH/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in S and A due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LRMCPI) and Low Rank Empirical Value Iteration (LREVI) achieve the desired sample complexity of Õ((S+A)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of S, A, and ε. In contrast to literature on linear and lowrank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal lowrank structural assumptions required on the MDP with respect to the transition kernel versus the optimal actionvalue function.
more »
« less
Optimal scheduling of multiple sensors which transmit measurements over a dynamic lossy network
Motivated by various distributed control applications, we consider a linear system with Gaussian noise observed by multiple sensors which transmit measurements over a dynamic lossy network. We characterize the stationary optimal sensor scheduling policy for the finite horizon, discounted, and longterm average cost problems and show that the value iteration algorithm converges to a solution of the average cost problem. We further show that the suboptimal policies provided by the rolling horizon truncation of the value iteration also guarantee geometric ergodicity and provide nearoptimal average cost. Lastly, we provide qualitative characterizations of the multidimensional set of measurement loss rates for which the system is stabilizable for a static network, significantly extending earlier results on intermittent observations.
more »
« less
 Award ID(s):
 1715210
 NSFPAR ID:
 10139611
 Date Published:
 Journal Name:
 2019 IEEE 58th Conference on Decision and Control (CDC)
 Page Range / eLocation ID:
 684 to 689
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)We consider an ultradense wireless network with N channels and M = N devices. Messages with fresh information are generated at each device according to a random process and need to be transmitted to an access point. The value of a message decreases as it ages, so each device searches for an idle channel to transmit the message as soon as it can. However, each channel probing is associated with a fixed cost (energy), so a device needs to adapt its probing rate based on the "age" of the message. At each device, the design of the optimal probing strategy can be formulated as an infinite horizon Markov Decision Process (MDP) where the devices compete with each other to find idle channels. While it is natural to view the system as a Bayesian game, it is often intractable to analyze such a system. Thus, we use the Mean Field Game (MFG) approach to analyze the system in a largesystem regime, where the number of devices is very large, to understand the structure of the problem and to find efficient probing strategies. We present an analysis based on the MFG perspective. We begin by characterizing the space of valid policies and use this to show the existence of a Mean Field Nash Equilibrium (MFNE) in a constrained set for any general increasing cost functions with diminishing rewards. Further we provide an algorithm for computing the equilibrium for any given device, and the corresponding agedependent channel probing policy.more » « less

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.more » « less

Virtual Reality (VR), together with the network infrastructure, can provide an interactive and immersive experience for multiple users simultaneously and thus enables collaborative VR applications (e.g., VRbased classroom). However, the satisfactory user experience requires not only highresolution panoramic image rendering but also extremely low latency and seamless user experience. Besides, the competition for limited network resources (e.g., multiple users share the total limited bandwidth) poses a significant challenge to collaborative user experience, in particular under the wireless network with timevarying capacities. While existing works have tackled some of these challenges, a principled design considering all those factors is still missing. In this paper, we formulate a combinatorial optimization problem to maximize the Quality of Experience (QoE), defined as the linear combination of the quality, the average VR content delivery delay, and variance of the quality over a finite time horizon. In particular, we incorporate the influence of imperfect motion prediction when considering the quality of the perceived contents. However, the optimal solution to this problem can not be implemented in realtime since it relies on future decisions. Then, we decompose the optimization problem into a series of combinatorial optimization in each time slot and develop a lowcomplexity algorithm that can achieve at least 1/2 of the optimal value. Despite this, the tracebased simulation results reveal that our algorithm performs very close to the optimal offline solution. Furthermore, we implement our proposed algorithm in a practical system with commercial mobile devices and demonstrate its superior performance over stateoftheart algorithms. We opensource our implementations on https://github.com/SNeCLabPSU/ICDCSCollaborativeVR.more » « less

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « less