Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcementlearning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions’ truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions’ truth value. Finally, the algorithm uses a Qlearning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.more » « lessFree, publiclyaccessible full text available August 1, 2025

In a Stackelberg game, a leader commits to a randomized strategy and a follower chooses their best strategy in response. We consider an extension of a standard Stackelberg game, called a discretetime dynamic Stackelberg game, that has an underlying state space that affects the leader’s rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower’s selected trategies. Although standard Stackelberg games have been utilized to improve scheduling in security domains, their deployment is often limited by requiring complete information of the follower’s utility function. In contrast, we consider scenarios where the follower’s utility function is unknown to the leader; however, it can be linearly parameterized. Our objective is then to provide an algorithm that prescribes a randomized strategy to the leader at each step of the game based on observations of how the follower responded in previous steps. We design an online learning algorithm that, with high probability, is noregret, i.e., achieves a regret bound (when compared to the best policy in hindsight), which is sublinear in the number of time steps; the degree of sublinearity depends on the number of features representing the follower’s utility function. The regret of the proposed learning algorithm is independent of the size of the state space and polynomial in the rest of the parameters of the game. We show that the proposed learning algorithm outperforms existing modelfree reinforcement learning approaches.more » « lessFree, publiclyaccessible full text available March 1, 2025

In cooperative multiagent reinforcement learning (CoMARL), a team of agents must jointly optimize the team's longterm rewards to learn a designated task. Optimizing rewards as a team often requires interagent communication and data sharing, leading to potential privacy implications. We assume privacy considerations prohibit the agents from sharing their environment interaction data. Accordingly, we propose PrivacyEngineered Value Decomposition Networks (PEVDN), a CoMARL algorithm that models multiagent coordination while provably safeguarding the confidentiality of the agents' environment interaction data. We integrate three privacyengineering techniques to redesign the data flows of the VDN algorithman existing CoMARL algorithm that consolidates the agents' environment interaction data to train a central controller that models multiagent coordinationand develop PEVDN. In the first technique, we design a distributed computation scheme that eliminates Vanilla VDN's dependency on sharing environment interaction data. Then, we utilize a privacypreserving multiparty computation protocol to guarantee that the data flows of the distributed computation scheme do not pose new privacy risks. Finally, we enforce differential privacy to preempt inference threats against the agents' training datapast environment interactionswhen they take actions based on their neural network predictions. We implement PEVDN in StarCraft MultiAgent Competition (SMAC) and show that it achieves 80% of Vanilla VDN's win rate while maintaining differential privacy levels that provide meaningful privacy guarantees. The results demonstrate that PEVDN can safeguard the confidentiality of agents' environment interaction data without sacrificing multiagent coordination.more » « lessFree, publiclyaccessible full text available December 13, 2024

Markov games model interactions among multiple players in a stochastic, dynamic environment. Each player in a Markov game maximizes its expected total discounted reward, which depends upon the policies of the other players. We formulate a class of Markov games, termed affine Markov games, where an affine reward function couples the players’ actions. We introduce a novel solution concept, the softBellman equilibrium, where each player is boundedly rational and chooses a softBellman policy rather than a purely rational policy as in the wellknown Nash equilibrium concept. We provide conditions for the existence and uniqueness of the softBellman equilibrium and propose a nonlinear leastsquares algorithm to compute such an equilibrium in the forward problem. We then solve the inverse game problem of inferring the players’ reward parameters from observed stateaction trajectories via a projectedgradient algorithm. Experiments in a predatorprey OpenAI Gym environment show that the reward parameters inferred by the proposed algorithm outper form those inferred by a baseline algorithm: they reduce the KullbackLeibler divergence between the equilibrium policies and observed policies by at least two orders of magnitude.more » « lessFree, publiclyaccessible full text available December 13, 2024

Stochastic matrices are commonly used to analyze Markov chains, but revealing them can leak sensitive information. Therefore, in this paper we introduce a technique to privatize stochastic matrices in a way that (i) conceals the probabilities they contain, and (ii) still allows for accurate analyses of Markov chains. Specifically, we use differential privacy, which is a statistical framework for protecting sensitive data. To implement it, we introduce the Matrix Dirichlet Mechanism, which is a probabilistic mapping that perturbs a stochastic matrix to provide privacy. We prove that this mechanism provides differential privacy, and we quantify the error induced in private stochastic matrices as a function of the strength of privacy being provided. We then bound the distance between the stationary distribution of the underlying, sensitive stochastic matrix and the stationary distribution of its privatized form. Numerical results show that, under typical conditions, privacy introduces error as low as 5.05% in the stationary distribution of a stochastic matrix.more » « less

Although perception is an increasingly dominant portion of the overall computational cost for autonomous systems, only a fraction of the information perceived is likely to be relevant to the current task. To alleviate these perception costs, we develop a novel simultaneous perception–action design framework wherein an agent senses only the taskrelevant information. This formulation differs from that of a partially observable Markov decision process, since the agent is free to synthesize not only its policy for action selection but also its beliefdependent observation function. The method enables the agent to balance its perception costs with those incurred by operating in its environment. To obtain a computationally tractable solution, we approximate the value function using a novel method of invariant finite belief sets, wherein the agent acts exclusively on a finite subset of the continuous belief space. We solve the approximate problem through value iteration in which a linear program is solved individually for each belief state in the set, in each iteration. Finally, we prove that the value functions, under an assumption on their structure, converge to their continuous statespace values as the sample density increases.more » « less

We study the problem of analyzing the effects of inconsistencies in perception, intent prediction, and decision making among interacting agents. When accounting for these effects, planning is akin to synthesizing policies in uncertain and potentially partiallyobservable environments. We consider the case where each agent, in an effort to avoid a difficult planning problem, does not consider the inconsistencies with other agents when computing its policy. In particular, each agent assumes that other agents compute their policies in the same way as it does, i.e., with the same objective and based on the same system model. While finding policies on the composed system model, which accounts for the agent interactions, scales exponentially, we efficiently provide quantifiable performance metrics in the form of deltas in the probability of satisfying a given specification. We showcase our approach using two realistic autonomous vehicle casestudies and implement it in an autonomous vehicle simulator.more » « less

Offline reinforcement learning (offline RL) considers problems where learning is performed using only previously collected samples and is helpful for the settings in which collecting new data is costly or risky. In modelbased offline RL, the learner performs estimation (or optimization) using a model constructed according to the empirical transition frequencies. We analyze the sample complexity of vanilla modelbased offline RL with dependent samples in the infinitehorizon discountedreward setting. In our setting, the samples obey the dynamics of the Markov decision process and, consequently, may have interdependencies. Under no assumption of independent samples, we provide a highprobability, polynomial sample complexity bound for vanilla modelbased offpolicy evaluation that requires partial or uniform coverage. We extend this result to the offpolicy optimization under uniform coverage. As a comparison to the modelbased approach, we analyze the sample complexity of offpolicy evaluation with vanilla importance sampling in the infinitehorizon setting. Finally, we provide an estimator that outperforms the samplemean estimator for almost deterministic dynamics that are prevalent in reinforcement learning.

Matni, Nikolai ; Morari, Manfred ; Pappas, George J (Ed.)Many dynamical systems—from robots interacting with their surroundings to largescale multiphysics systems—involve a number of interacting subsystems. Toward the objective of learning composite models of such systems from data, we present i) a framework for compositional neural networks, ii) algorithms to train these models, iii) a method to compose the learned models, iv) theoretical results that bound the error of the resulting composite models, and v) a method to learn the composition itself, when it is not known a priori. The end result is a modular approach to learning: neural network submodels are trained on trajectory data generated by relatively simple subsystems, and the dynamics of more complex composite systems are then predicted without requiring additional data generated by the composite systems themselves. We achieve this compositionality by representing the system of interest, as well as each of its subsystems, as a portHamiltonian neural network (PHNN)—a class of neural ordinary differential equations that uses the portHamiltonian systems formulation as inductive bias. We compose collections of PHNNs by using the system’s physicsinformed interconnection structure, which may be known a priori, or may itself be learned from data. We demonstrate the novel capabilities of the proposed framework through numerical examples involving interacting springmassdamper systems. Models of these systems, which include nonlinear energy dissipation and control inputs, are learned independently. Accurate compositions are learned using an amount of training data that is negligible in comparison with that required to train a new model from scratch. Finally, we observe that the composite PHNNs enjoy properties of portHamiltonian systems, such as cyclopassivity—a property that is useful for control purposes.more » « less