In many real-world problems, we are faced with the problem of selecting the best among a finite number of alternatives, where the best alternative is determined based on context specific information. In this work, we study the contextual Ranking and Selection problem under a finite-alternative-finite-context setting, where we aim to find the best alternative for each context. We use a separate Gaussian process to model the reward for each alternative and derive the large deviations rate function for both the expected and worst-case contextual probability of correct selection. We propose the GP-C-OCBA sampling policy, which uses the Gaussian process posterior to iteratively allocate observations to maximize the rate function. We prove its consistency and show that it achieves the optimal convergence rate under the assumption of a non-informative prior. Numerical experiments show that our algorithm is highly competitive in terms of sampling efficiency, while having significantly smaller computational overhead.
more »
« less
Information theory for ranking and selection
Abstract We study the classical ranking and selection problem, where the ultimate goal is to find the unknown best alternative in terms of the probability of correct selection or expected opportunity cost. However, this paper adopts an alternative sampling approach to achieve this goal, where sampling decisions are made with the objective of maximizing information about the unknown best alternative, or equivalently, minimizing its Shannon entropy. This adaptive learning is formulated via a Bayesian stochastic dynamic programming problem, by which several properties of the learning problem are presented, including the monotonicity of the optimal value function in an information‐seeking setting. Since the state space of the stochastic dynamic program is unbounded in the Gaussian setting, a one‐step look‐ahead approach is used to develop a policy. The proposed policy seeks to maximize the one‐step information gain about the unknown best alternative, and therefore, it is called information gradient (IG). It is also proved that the IG policy is consistent, that is, as the sampling budget grows to infinity, the IG policy finds the true best alternative almost surely. Later, a computationally efficient estimate of the proposed policy, called approximated information gradient (AIG), is introduced and in the numerical experiments its performance is tested against recent benchmarks alongside several sensitivity analyses. Results show that AIG performs competitively against other algorithms from the literature.
more »
« less
- Award ID(s):
- 1651912
- PAR ID:
- 10457719
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Naval Research Logistics (NRL)
- Volume:
- 67
- Issue:
- 4
- ISSN:
- 0894-069X
- Page Range / eLocation ID:
- p. 239-253
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A key and challenging step toward personalized/precision medicine is the ability to redesign dose-finding clinical trials. This work studies a problem of fully response-adaptive Bayesian design of phase II dose-finding clinical trials with patient information, where the decision maker seeks to identify the right dose for each patient type (often defined as an effective target dose for each group of patients) by minimizing the expected (over patient types) variance of the right dose. We formulate this problem by a stochastic dynamic program and exploit a few properties of this class of learning problems. Because the optimal solution is intractable, we propose an approximate policy by an adaptation of a one-step look-ahead framework. We show the optimality of the proposed policy for a setting with homogeneous patients and two doses and find its asymptotic rate of sampling. We adapt a number of commonly applied allocation policies in dose-finding clinical trials, such as posterior adaptive sampling, and test their performance against our proposed policy via extensive simulations with synthetic and real data. Our numerical analyses provide insights regarding the connection between the structure of the dose-response curve for each patient type and the performance of allocation policies. This paper provides a practical framework for the Food and Drug Administration and pharmaceutical companies to transition from the current phase II procedures to the era of personalized dose-finding clinical trials. Funding: This research is supported by the National Science Foundation [Grant 1651912]. Supplemental Material: The online appendices are available at https://doi.org/10.1287/serv.2022.0306 .more » « less
-
A learner aims to minimize a function f by repeatedly querying a distributed oracle that provides noisy gradient evaluations. At the same time, the learner seeks to hide arg min f from a malicious eavesdropper that observes the learner’s queries. This paper considers the problem of covert or learner-private optimization, where the learner has to dynamically choose between learning and obfuscation by exploiting the stochasticity. The problem of controlling the stochastic gradient algorithm for covert optimization is modeled as a Markov decision process, and we show that the dynamic programming operator has a supermodular structure implying that the optimal policy has a monotone threshold structure. A computationally efficient policy gradient algorithm is proposed to search for the optimal querying policy without knowledge of the transition probabilities. As a practical application, our methods are demonstrated on a hate speech classification task in a federated setting where an eavesdropper can use the optimal weights to generate toxic content, which is more easily misclassified. Numerical results show that when the learner uses the optimal policy, an eavesdropper can only achieve a validation accuracy of 52% with no information and 69% when it has a public dataset with 10% positive samples compared to 83% when the learner employs a greedy policy.more » « less
-
This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results: (i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be O(1/ k); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment illustrate the superior performance of the proposed algorithm.more » « less
-
How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as a contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm’s effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.more » « less
An official website of the United States government
