Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Free, publiclyaccessible full text available December 1, 2023

Free, publiclyaccessible full text available December 1, 2023

In many realworld reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinitehorizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two modelbased constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GMCRL algorithm, where the algorithm has access to a generative model, and (ii) UCCRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.

In many realworld reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinitehorizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two modelbased constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GMCRL algorithm, where the algorithm has access to a generative model, and (ii) UCCRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.

We consider an ultradense wireless network with N channels and M = N devices. Messages with fresh information are generated at each device according to a random process and need to be transmitted to an access point. The value of a message decreases as it ages, so each device searches for an idle channel to transmit the message as soon as it can. However, each channel probing is associated with a fixed cost (energy), so a device needs to adapt its probing rate based on the "age" of the message. At each device, the design of the optimal probing strategy can be formulated as an infinite horizon Markov Decision Process (MDP) where the devices compete with each other to find idle channels. While it is natural to view the system as a Bayesian game, it is often intractable to analyze such a system. Thus, we use the Mean Field Game (MFG) approach to analyze the system in a largesystem regime, where the number of devices is very large, to understand the structure of the problem and to find efficient probing strategies. We present an analysis based on the MFG perspective. We begin by characterizing the space of valid policies andmore »

Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracyboth objective maximization and constraint satisfactionin a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.

Mean Field Games (MFG) are the class of games with a very large number of agents and the standard equilibrium concept is a Mean Field Equilibrium (MFE). Algorithms for learning MFE in dynamic MFGs are unknown in general. Our focus is on an important subclass that possess a monotonicity property called Strategic Complementarities (MFGSC). We introduce a natural refinement to the equilibrium concept that we call TremblingHandPerfect MFE (TMFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing TMFE under a known model. We also introduce a modelfree and a modelbased approach to learning TMFE and provide sample complexities of both algorithms. We also develop a fully online learning scheme that obviates the need for a simulator. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by realworld applications.

Mean Field Games (MFG) are the class of games with a very large number of agents and the standard equilibrium concept is a Mean Field Equilibrium (MFE). Algorithms for learning MFE in dynamic MFGs are un known in general. Our focus is on an important subclass that possess a monotonicity property called Strategic Complementarities (MFGSC). We introduce a natural refinement to the equilibrium concept that we call TremblingHandPerfect MFE (TMFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing TMFE under a known model. We also introduce a modelfree and a modelbased approach to learning TMFE and provide sample complexities of both algorithms. We also develop a fully online learning scheme that obviates the need for a simulator. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by realworld applications.

The ability of a P2P network to scale its throughput up in proportion to the arrival rate of peers has recently been shown to be crucially dependent on the chunk sharing policy employed. Some policies can result in low frequencies of a particular chunk, known as the missing chunk syndrome, which can dramatically reduce throughput and lead to instability of the system. For instance, commonly used policies that nominally ``boost'' the sharing of infrequent chunks such as the wellknown rarestfirst algorithm have been shown to be unstable. We take a complementary viewpoint, and instead consider a policy that simply prevents the sharing of the most frequent chunk(s), that we call modesuppression. We also consider a more general version that suppresses the mode only if the mode frequency is larger than the lowest frequency by a fixed threshold. We prove the stability of modesuppression using Lyapunov techniques, and use a Kingman bound argument to show that the total download time does not increase with peer arrival rate. We then design versions of modesuppression that sample a small number of peers at each time, and construct noisy mode estimates by aggregating these samples over time. We show numerically that mode suppression stabilizesmore »