skip to main content


Title: Safe Online Convex Optimization with Unknown Linear Safety Constraints
We study the problem of safe online convex optimization, where the action at each time step must satisfy a set of linear safety constraints. The goal is to select a sequence of ac- tions to minimize the regret without violating the safety constraints at any time step (with high probability). The parameters that specify the linear safety constraints are unknown to the algorithm. The algorithm has access to only the noisy observations of constraints for the chosen actions. We pro- pose an algorithm, called the Safe Online Projected Gradient Descent(SO-PGD) algorithm to address this problem. We show that, under the assumption of the availability of a safe baseline action, the SO-PGD algorithm achieves a regret O(T^2/3). While there are many algorithms for online convex optimization (OCO) problems with safety constraints avail- able in the literature, they allow constraint violations during learning/optimization, and the focus has been on characterizing the cumulative constraint violations. To the best of our knowledge, ours is the first work that provides an algorithm with provable guarantees on the regret, without violating the linear safety constraints (with high probability) at any time step.  more » « less
Award ID(s):
2045783 1850206
NSF-PAR ID:
10327541
Author(s) / Creator(s):
;
Date Published:
Journal Name:
AAAI Conference on Artificial Intelligence (AAAI)
ISSN:
2159-5399
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$ without violating the safety constraints during learning, where $|\mathcal{S}|$ is the number of states, $|\mathcal{A}|$ is the number of actions, and $K$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches. 
    more » « less
  2. In this work, we consider a distributed online convex optimization problem, with time-varying (potentially adversarial) constraints. A set of nodes, jointly aim to minimize a global objective function, which is the sum of local convex functions. The objective and constraint functions are revealed locally to the nodes, at each time, after taking an action. Naturally, the constraints cannot be instantaneously satisfied. Therefore, we reformulate the problem to satisfy these constraints in the long term. To this end, we propose a distributed primal-dual mirror descent-based algorithm, in which the primal and dual updates are carried out locally at all the nodes. This is followed by sharing and mixing of the primal variables by the local nodes via communication with the immediate neighbors. To quantify the performance of the proposed algorithm, we utilize the challenging, but more realistic metrics of dynamic regret and fit. Dynamic regret measures the cumulative loss incurred by the algorithm compared to the best dynamic strategy, while fit measures the long term cumulative constraint violations. Without assuming the restrictive Slater’s conditions, we show that the proposed algorithm achieves sublinear regret and fit under mild, commonly used assumptions. 
    more » « less
  3. null ; null ; null ; null ; null (Ed.)
    We consider an online optimization problem in which the reward functions are DR-submodular, and in addition to maximizing the total reward, the sequence of decisions must satisfy some convex constraints on average. Specifically, at each round t, upon committing to an action x_t, a DR-submodular utility function f_t and a convex constraint function g_t are revealed, and the goal is to maximize the overall utility while ensuring the average of the constraint functions is non-positive (so constraints are satisfied on average). Such cumulative constraints arise naturally in applications where the average resource consumption is required to remain below a specified threshold. We study this problem under an adversarial model and a stochastic model for the convex constraints, where the functions g_t can vary arbitrarily or according to an i.i.d. process over time. We propose a single algorithm which achieves sub-linear regret (with respect to the time horizon T) as well as sub-linear constraint violation bounds in both settings, without prior knowledge of the regime. Prior works have studied this problem in the special case of linear constraint functions. Our results not only improve upon the existing bounds under linear cumulative constraints, but also give the first sub-linear bounds for general convex long-term constraints. 
    more » « less
  4. This paper considers online convex optimization (OCO) with stochastic constraints, which generalizes Zinkevich’s OCO over a known simple fixed set by introducing multiple stochastic functional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environ- ments or deterministic environments with noisy observations. It also includes many important problems as special case, such as OCO with long term constraints, stochastic constrained convex optimization, and deterministic constrained con- vex optimization. To solve this problem, this paper proposes a new algorithm that achieves O(√T ) expected regret and constraint violations and O(√T log(T )) high probability regret and constraint violations. Experiments on a real-world data center scheduling problem further verify the performance of the new algorithm. 
    more » « less
  5. This paper considers online convex optimization (OCO) with stochastic constraints, which generalizes Zinkevich’s OCO over a known simple fixed set by introducing multiple stochastic functional constraints that are i.i.d. generated at each round and are disclosed to the decision maker only after the decision is made. This formulation arises naturally when decisions are restricted by stochastic environ- ments or deterministic environments with noisy observations. It also includes many important problems as special case, such as OCO with long term constraints, stochastic constrained convex optimization, and deterministic constrained con- vex optimization. To solve this problem, this paper proposes a new algorithm that achieves O(√T ) expected regret and constraint violations and O(√T log(T )) high probability regret and constraint violations. Experiments on a real-world data center scheduling problem further verify the performance of the new algorithm. 
    more » « less