We study the problem of synthesizing
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract lockdown policies —schedules of maximum capacities for different types of activity sites—to minimize the number of deceased individuals due to a pandemic within a given metropolitan statistical area (MSA) while controlling the severity of the imposed lockdown. To synthesize and evaluate lockdown policies, we develop a multiscale susceptible, infected, recovered, and deceased model that partitions a given MSA into geographic subregions, and that incorporates data on the behaviors of the populations of these subregions. This modeling approach allows for the analysis of heterogeneous lockdown policies that vary across the different types of activity sites within each subregion of the MSA. We formulate the synthesis of optimal lockdown policies as a nonconvex optimization problem and we develop an iterative algorithm that addresses this nonconvexity through sequential convex programming. We empirically demonstrate the effectiveness of the developed approach by applying it to six of the largest MSAs in the United States. The developed heterogeneous lockdown policies not only reduce the number of deceased individuals by up to 45 percent over a 100 day period in comparison with three baseline lockdown policies that are less heterogeneous, but they also impose lockdowns that are less severe. 
We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcementlearning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions’ truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions’ truth value. Finally, the algorithm uses a Qlearning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.more » « lessFree, publiclyaccessible full text available August 1, 2025

In a Stackelberg game, a leader commits to a randomized strategy and a follower chooses their best strategy in response. We consider an extension of a standard Stackelberg game, called a discretetime dynamic Stackelberg game, that has an underlying state space that affects the leader’s rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower’s selected trategies. Although standard Stackelberg games have been utilized to improve scheduling in security domains, their deployment is often limited by requiring complete information of the follower’s utility function. In contrast, we consider scenarios where the follower’s utility function is unknown to the leader; however, it can be linearly parameterized. Our objective is then to provide an algorithm that prescribes a randomized strategy to the leader at each step of the game based on observations of how the follower responded in previous steps. We design an online learning algorithm that, with high probability, is noregret, i.e., achieves a regret bound (when compared to the best policy in hindsight), which is sublinear in the number of time steps; the degree of sublinearity depends on the number of features representing the follower’s utility function. The regret of the proposed learning algorithm is independent of the size of the state space and polynomial in the rest of the parameters of the game. We show that the proposed learning algorithm outperforms existing modelfree reinforcement learning approaches.more » « lessFree, publiclyaccessible full text available March 1, 2025

Decisionmaking in multiplayer games can be extremely challenging, particularly under uncertainty. In this work, we propose a new samplebased approximation to a class of stochastic, generalsum, pure Nash games, where each player has an expectedvalue objective and a set of chance constraints. This new approximation scheme inherits the accuracy of objective approximation from the established sample average approximation (SAA) method and enjoys a feasibility guarantee derived from the scenario optimization literature. We characterize the sample complexity of this new gametheoretic approximation scheme, and observe that high accuracy usually requires a large number of samples, which results in a large number of sampled constraints. To accommodate this, we decompose the approximated game into a set of smaller games with few constraints for each sampled scenario, and propose a decentralized, consensusbased ADMM algorithm to efficiently compute a generalized Nash equilibrium (GNE) of the approximated game. We prove the convergence of our algorithm to a GNE and empirically demonstrate superior performance relative to a recent baseline algorithm based on ADMM and interior point method.more » « lessFree, publiclyaccessible full text available December 13, 2024

Although perception is an increasingly dominant portion of the overall computational cost for autonomous systems, only a fraction of the information perceived is likely to be relevant to the current task. To alleviate these perception costs, we develop a novel simultaneous perception–action design framework wherein an agent senses only the taskrelevant information. This formulation differs from that of a partially observable Markov decision process, since the agent is free to synthesize not only its policy for action selection but also its beliefdependent observation function. The method enables the agent to balance its perception costs with those incurred by operating in its environment. To obtain a computationally tractable solution, we approximate the value function using a novel method of invariant finite belief sets, wherein the agent acts exclusively on a finite subset of the continuous belief space. We solve the approximate problem through value iteration in which a linear program is solved individually for each belief state in the set, in each iteration. Finally, we prove that the value functions, under an assumption on their structure, converge to their continuous statespace values as the sample density increases.more » « less

We study the problem of analyzing the effects of inconsistencies in perception, intent prediction, and decision making among interacting agents. When accounting for these effects, planning is akin to synthesizing policies in uncertain and potentially partiallyobservable environments. We consider the case where each agent, in an effort to avoid a difficult planning problem, does not consider the inconsistencies with other agents when computing its policy. In particular, each agent assumes that other agents compute their policies in the same way as it does, i.e., with the same objective and based on the same system model. While finding policies on the composed system model, which accounts for the agent interactions, scales exponentially, we efficiently provide quantifiable performance metrics in the form of deltas in the probability of satisfying a given specification. We showcase our approach using two realistic autonomous vehicle casestudies and implement it in an autonomous vehicle simulator.more » « less

Offline reinforcement learning (offline RL) considers problems where learning is performed using only previously collected samples and is helpful for the settings in which collecting new data is costly or risky. In modelbased offline RL, the learner performs estimation (or optimization) using a model constructed according to the empirical transition frequencies. We analyze the sample complexity of vanilla modelbased offline RL with dependent samples in the infinitehorizon discountedreward setting. In our setting, the samples obey the dynamics of the Markov decision process and, consequently, may have interdependencies. Under no assumption of independent samples, we provide a highprobability, polynomial sample complexity bound for vanilla modelbased offpolicy evaluation that requires partial or uniform coverage. We extend this result to the offpolicy optimization under uniform coverage. As a comparison to the modelbased approach, we analyze the sample complexity of offpolicy evaluation with vanilla importance sampling in the infinitehorizon setting. Finally, we provide an estimator that outperforms the samplemean estimator for almost deterministic dynamics that are prevalent in reinforcement learning.

An atomic routing game is a multiplayer game on a directed graph. Each player in the game chooses a path—a sequence of links that connect its origin node to its destination node—with the lowest cost, where the cost of each link is a function of all players’ choices. We develop a novel numerical method to design the link cost function in atomic routing games such that the players’ choices at the Nash equilibrium minimize a given smooth performance function. This method first approximates the nonsmooth Nash equilibrium conditions with smooth ones, then iteratively improves the link cost function via implicit differentiation. We demonstrate the application of this method to atomic routing games that model noncooperative agents navigating in grid worlds.more » « less