skip to main content


Title: Fast Learning for Renewal Optimization in Online Task Scheduling
This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the \emph{task type vector}, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like $O(1/\sqrt{k})$, where $k$ is the number of tasks processed. The same algorithm is shown to have faster $O(\log(k)/k)$ performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and on the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best $\Omega(\log(k)/k)$. A matching $\Omega(1/\sqrt{k})$ converse is also shown for the general case without strong concavity.  more » « less
Award ID(s):
1718477
NSF-PAR ID:
10252441
Author(s) / Creator(s):
Date Published:
Journal Name:
ArXivorg
ISSN:
2331-8422
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the task type vector, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time-average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like O(1/√k), where k is the number of tasks processed. The same algorithm is shown to have faster O(log(k)/k) performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best Ω(log(k)/k). A matching Ω(1/√k) converse is also shown for the general case without strong concavity. 
    more » « less
  2. arXiv:2401.07170v1 (Ed.)
    This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let A[k] be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task k. The goal is to observe A[k] at the start of each new task k and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a renewal optimization problem and is challenging because the probability distribution for the A[k] sequence is unknown. Prior work shows that any algorithm that comes within ϵ of optimality must have (1/ϵ^2) convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within O(ϵ) of optimality for any interval of  (1/ϵ^2) tasks over which probabilities are held fixed, regardless of probabilities before the start of the interval. 
    more » « less
  3. Woodruff, David P. (Ed.)
    We give improved algorithms for maintaining edge-orientations of a fully-dynamic graph, such that the maximum out-degree is bounded. On one hand, we show how to orient the edges such that maximum out-degree is proportional to the arboricity $\alpha$ of the graph, in, either, an amortised update time of $O(\log^2 n \log \alpha)$, or a worst-case update time of $O(\log^3 n \log \alpha)$. On the other hand, motivated by applications including dynamic maximal matching, we obtain a different trade-off. Namely, the improved update time of either $O(\log n \log \alpha)$, amortised, or $O(\log ^2 n \log \alpha)$, worst-case, for the problem of maintaining an edge-orientation with at most $O(\alpha + \log n)$ out-edges per vertex. Finally, all of our algorithms naturally limit the recourse to be polylogarithmic in $n$ and $\alpha$. Our algorithms adapt to the current arboricity of the graph, and yield improvements over previous work: Firstly, we obtain deterministic algorithms for maintaining a $(1+\varepsilon)$ approximation of the maximum subgraph density, $\rho$, of the dynamic graph. Our algorithms have update times of $O(\varepsilon^{-6}\log^3 n \log \rho)$ worst-case, and $O(\varepsilon^{-4}\log^2 n \log \rho)$ amortised, respectively. We may output a subgraph $H$ of the input graph where its density is a $(1+\varepsilon)$ approximation of the maximum subgraph density in time linear in the size of the subgraph. These algorithms have improved update time compared to the $O(\varepsilon^{-6}\log ^4 n)$ algorithm by Sawlani and Wang from STOC 2020. Secondly, we obtain an $O(\varepsilon^{-6}\log^3 n \log \alpha)$ worst-case update time algorithm for maintaining a $(1~+~\varepsilon)\textnormal{OPT} + 2$ approximation of the optimal out-orientation of a graph with adaptive arboricity $\alpha$, improving the $O(\varepsilon^{-6}\alpha^2 \log^3 n)$ algorithm by Christiansen and Rotenberg from ICALP 2022. This yields the first worst-case polylogarithmic dynamic algorithm for decomposing into $O(\alpha)$ forests. Thirdly, we obtain arboricity-adaptive fully-dynamic deterministic algorithms for a variety of problems including maximal matching, $\Delta+1$ colouring, and matrix vector multiplication. All update times are worst-case $O(\alpha+\log^2n \log \alpha)$, where $\alpha$ is the current arboricity of the graph. For the maximal matching problem, the state-of-the-art deterministic algorithms by Kopelowitz, Krauthgamer, Porat, and Solomon from ICALP 2014 runs in time $O(\alpha^2 + \log^2 n)$, and by Neiman and Solomon from STOC 2013 runs in time $O(\sqrt{m})$. We give improved running times whenever the arboricity $\alpha \in \omega( \log n\sqrt{\log\log n})$. 
    more » « less
  4. This paper proves an impossibility result for stochastic network utility maximization for multi-user wireless systems, including multi-access and broadcast systems. Every time slot an access point observes the current channel states and opportunistically selects a vector of transmission rates. Channel state vectors are assumed to be independent and identically distributed with an unknown probability distribution. The goal is to learn to make decisions over time that maximize a concave utility function of the running time average transmission rate of each user. Recently it was shown that a stochastic Frank-Wolfe algorithm converges to utility-optimality with an error of O(log(T)/T ), where T is the time the algorithm has been running. An existing O(1/T ) converse is known. The current paper improves the converse to O(log(T)/T), which matches the known achievability result. The proof uses a reduction from the opportunistic scheduling problem to a Bernoulli estimation problem. Along the way, it refines a result on Bernoulli estimation. 
    more » « less
  5. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{-2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worst-case mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less