This letter studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such prob- lems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and per- form SG. The oracle probabilistically returns either a noisy gradient of the function or a non-informative measure- ment, depending on the oracle state and incentive. The learner’s query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This letter formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi– armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.
more »
« less
Looking Upstream: Optimal Policies for a Class of Capacitated Multi-Stage Inventory Systems
We consider a multi-stage inventory system with stochastic demand and processing capacity constraints at each stage, for both finite-horizon and infinite-horizon, discounted-cost settings. For a class of such systems characterized by having the smallest capacity at the most downstream stage and system utilization above a certain threshold, we identify the structure of the optimal policy, which represents a novel variation of the order-up-to policy. We find the explicit functional form of the optimal order-up-to levels, and show that they depend (only) on upstream echelon inventories. We establish that, above the threshold utilization, this optimal policy achieves the decomposition of the multidimensional objective cost function for the system into a sum of single-dimensional convex functions. This decomposition eliminates the curse of dimensionality and allows us to numerically solve the problem. We provide a fast algorithm to determine a (tight) upper bound on this threshold utilization for capacity-constrained inventory problems with an arbitrary number of stages. We make use of this algorithm to quantify upper bounds on the threshold utilization for three-, four-, and five-stage capacitated systems over a range of model parameters, and discuss insights that emerge.
more »
« less
- Award ID(s):
- 1644935
- PAR ID:
- 10041616
- Date Published:
- Journal Name:
- Production and operations management
- ISSN:
- 1059-1478
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $$S$$ states, $$A$$ actions and horizon length $$H$$, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $$\sqrt{H^2SAT}$$ (modulo log factors) with $$T$$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $$S^6A^4 \,\mathrm{poly}(H)$$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $$SA\,\mathrm{poly}(H)$$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.more » « less
-
We study matching queues with abandonment. The simplest of these is the two-sided queue with servers on one side and customers on the other, both arriving dynamically over time and abandoning if not matched by the time their patience elapses. We identify nonasymptotic and universal scaling laws for the matching loss due to abandonment, which we refer to as the “cost of impatience.” The scaling laws characterize the way in which this cost depends on the arrival rates and the (possibly different) mean patience of servers and customers. Our characterization reveals four operating regimes identified by an operational measure of patience that brings together mean patience and utilization. The four regimes subsume the regimes that arise in asymptotic (heavy-traffic) approximations. The scaling laws, specialized to each regime, reveal the fundamental structure of the cost of impatience and show that its order of magnitude is fully determined by (i) a “winner-take-all” competition between customer impatience and utilization, and (ii) the ability to accumulate inventory on the server side. Practically important is that when servers are impatient, the cost of impatience is, up to an order of magnitude, given by an insightful expression where only the minimum of the two patience rates appears. Considering the trade-off between abandonment and capacity costs, we characterize the scaling of the optimal safety capacity as a function of costs, arrival rates, and patience parameters. We prove that the ability to hold inventory of servers means that the optimal safety capacity grows logarithmically in abandonment cost and, in turn, slower than the square-root growth in the single-sided queue. This paper was accepted by Baris Ata, stochastic models and simulation. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.01513 .more » « less
-
Chaudhuri, Kamalika and (Ed.)We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $$\widetilde{O}(\sqrt{H^4S^2AT})$$ while requiring a switching cost of $$O(HSA \log\log T)$$. This is an exponential improvement over the best-known switching cost $$O(H^2SA\log T)$$ among existing methods with $$\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$$ regret. In the above, $S,A$ denotes the number of states and actions in an $$H$$-horizon episodic Markov Decision Process model with unknown transitions, and $$T$$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $$\Omega(HSA)$$; (2) Any $$\widetilde{O}(\sqrt{T})$$ regret algorithm must incur a switching cost of $$\Omega(HSA\log\log T)$$. Both our algorithms are thus optimal in their switching costs.more » « less
-
We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵ-optimal policy with O˜(H2/dmϵ2) episodes of offline data in the finite-horizon stationary transition setting, where H is the horizon length and dm is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of H. Moreover, we establish an information-theoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.more » « less
An official website of the United States government

