This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the task type vector, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time-average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like O(1/√k), where k is the number of tasks processed. The same algorithm is shown to have faster O(log(k)/k) performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best Ω(log(k)/k). A matching Ω(1/√k) converse is also shown for the general case without strong concavity.
more »
« less
Fast Learning for Renewal Optimization in Online Task Scheduling
This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the \emph{task type vector}, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like $$O(1/\sqrt{k})$$, where $$k$$ is the number of tasks processed. The same algorithm is shown to have faster $$O(\log(k)/k)$$ performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and on the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best $$\Omega(\log(k)/k)$$. A matching $$\Omega(1/\sqrt{k})$$ converse is also shown for the general case without strong concavity.
more »
« less
- Award ID(s):
- 1718477
- PAR ID:
- 10252441
- Date Published:
- Journal Name:
- ArXivorg
- ISSN:
- 2331-8422
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
arXiv:2401.07170v1 (Ed.)This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let A[k] be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task k. The goal is to observe A[k] at the start of each new task k and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a renewal optimization problem and is challenging because the probability distribution for the A[k] sequence is unknown. Prior work shows that any algorithm that comes within ϵ of optimality must have (1/ϵ^2) convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within O(ϵ) of optimality for any interval of (1/ϵ^2) tasks over which probabilities are held fixed, regardless of probabilities before the start of the interval.more » « less
-
In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of $${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$$, where $$d$$ is the problem dimension and $$k$$ is the number of iterations. In particular, in the regime where $$k = \bigO(d)$$, our method matches the \emph{optimal rate} of $$\mathcal{O}(\frac{1}{k^2})$$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $$k = \Omega(d \log d)$$, it outperforms NAG and converges at a \emph{faster rate} of $$\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices.more » « less
-
This paper proves an impossibility result for stochastic network utility maximization for multi-user wireless systems, including multi-access and broadcast systems. Every time slot an access point observes the current channel states and opportunistically selects a vector of transmission rates. Channel state vectors are assumed to be independent and identically distributed with an unknown probability distribution. The goal is to learn to make decisions over time that maximize a concave utility function of the running time average transmission rate of each user. Recently it was shown that a stochastic Frank-Wolfe algorithm converges to utility-optimality with an error of O(log(T)/T ), where T is the time the algorithm has been running. An existing O(1/T ) converse is known. The current paper improves the converse to O(log(T)/T), which matches the known achievability result. The proof uses a reduction from the opportunistic scheduling problem to a Bernoulli estimation problem. Along the way, it refines a result on Bernoulli estimation.more » « less
-
Chaudhuri, Kamalika and (Ed.)We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $$\widetilde{O}(\sqrt{H^4S^2AT})$$ while requiring a switching cost of $$O(HSA \log\log T)$$. This is an exponential improvement over the best-known switching cost $$O(H^2SA\log T)$$ among existing methods with $$\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$$ regret. In the above, $S,A$ denotes the number of states and actions in an $$H$$-horizon episodic Markov Decision Process model with unknown transitions, and $$T$$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $$\Omega(HSA)$$; (2) Any $$\widetilde{O}(\sqrt{T})$$ regret algorithm must incur a switching cost of $$\Omega(HSA\log\log T)$$. Both our algorithms are thus optimal in their switching costs.more » « less
An official website of the United States government

