We propose a Bayesian decision making framework for control of Markov Decision Processes (MDPs) with unknown dynamics and large, possibly continuous, state, action, and parameter spaces in data-poor environments. Most of the existing adaptive controllers for MDPs with unknown dynamics are based on the reinforcement learning framework and rely on large data sets acquired by sustained direct interaction with the system or via a simulator. This is not feasible in many applications, due to ethical, economic, and physical constraints. The proposed framework addresses the data poverty issue by decomposing the problem into an offline planning stage that does not rely on sustained direct interaction with the system or simulator and an online execution stage. In the offline process, parallel Gaussian process temporal difference (GPTD) learning techniques are employed for near-optimal Bayesian approximation of the expected discounted reward over a sample drawn from the prior distribution of unknown parameters. In the online stage, the action with the maximum expected return with respect to the posterior distribution of the parameters is selected. This is achieved by an approximation of the posterior distribution using a Markov Chain Monte Carlo (MCMC) algorithm, followed by constructing multiple Gaussian processes over the parameter space for efficient prediction of the means of the expected return at the MCMC sample. The effectiveness of the proposed framework is demonstrated using a simple dynamical system model with continuous state and action spaces, as well as a more complex model for a metastatic melanoma gene regulatory network observed through noisy synthetic gene expression data.
more »
« less
Stochastic Approximation for Multi-period Simulation Optimization with Streaming Input Data
We consider a continuous-valued simulation optimization (SO) problem, where a simulator is built to optimize an expected performance measure of a real-world system while parameters of the simulator are estimated from streaming data collected periodically from the system. At each period, a new batch of data is combined with the cumulative data and the parameters are re-estimated with higher precision. The system requires the decision variable to be selected in all periods. Therefore, it is sensible for the decision-maker to update the decision variable at each period by solving a more precise SO problem with the updated parameter estimate to reduce the performance loss with respect to the target system. We define this decision-making process as the multi-period SO problem and introduce a multi-period stochastic approximation (SA) framework that generates a sequence of solutions. Two algorithms are proposed: Re-start SA (ReSA) reinitializes the stepsize sequence in each period, whereas Warm-start SA (WaSA) carefully tunes the stepsizes, taking both fewer and shorter gradient-descent steps in later periods as parameter estimates become increasingly more precise. We show that under suitable strong convexity and regularity conditions,ReSAandWaSAachieve the best possible convergence rate in expected sub-optimality either when an unbiased or a simultaneous perturbation gradient estimator is employed, whileWaSAaccrues significantly lower computational cost as the number of periods increases. In addition, we present theregularizedReSA, which obviates the need to know the strong convexity constant and achieves the same convergence rate at the expense of additional computation.
more »
« less
- Award ID(s):
- 2246281
- PAR ID:
- 10522751
- Publisher / Repository:
- ACM TOMACS
- Date Published:
- Journal Name:
- ACM Transactions on Modeling and Computer Simulation
- Volume:
- 34
- Issue:
- 2
- ISSN:
- 1049-3301
- Page Range / eLocation ID:
- 1 to 27
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
An influence diagram is a graphical model of a Bayesian decision problem that is solved by finding a strategy that maximizes expected utility. When an influence diagram is solved by variable elimination or a related dynamic programming algorithm, it is traditional to represent a strategy as a sequence of policies, one for each decision variable, where a policy maps the relevant history for a decision to an action. We propose an alternative representation of a strategy as a graph, called a strategy graph, and show how to modify a variable elimination algorithm so that it constructs a strategy graph. We consider both a classic variable elimination algorithm for influence diagrams and a recent extension of this algorithm that has more relaxed constraints on elimination order that allow improved performance. We consider the advantages of representing a strategy as a graph and, in particular, how to simplify a strategy graph so that it is easier to interpret and analyze.more » « less
-
We consider the problem of inferring the conditional independence graph (CIG) of high-dimensional Gaussian vectors from multi-attribute data. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar random variable with each node. In multi-attribute graphical models, each node represents a random vector. In this paper we provide a unified theoretical analysis of multi-attribute graph learning using a penalized log-likelihood objective function. We consider both convex (sparse-group lasso) and non-convex (log-sum and SCAD group penalties) penalty/regularization functions. We establish sufficient conditions in a high-dimensional setting for consistency (convergence of the precision matrix to true value in the Frobenius norm), local convexity when using non-convex penalties, and graph recovery. We do not impose any incoherence or irrepresentability condition for our convergence results.more » « less
-
Estimating and quantifying uncertainty in system parameters remains a big challenge in applied and computational mathematics. A subset of these problems includes estimating periodic parameters that have unknown dynamics. Along with their time series, the period of these parameters may also be unknown and need to be estimated. The aim of this paper is to address the periodic parameter estimation problem, with particular focus on exploring the associated uncertainty, using Monte Carlo particle methods, such as the ensemble Kalman filter. Both parameter tracking and piecewise function approximations of periodic parameters are considered, highlighting aspects of parameter uncertainty in each approach when considering factors such as the frequency of available data and the number of piecewise segments used in the approximation. Estimation of the period of the periodic parameters and related uncertainty is also analyzed in the piecewise formulation. The pros and cons of each approach are discussed relative to a numerical example estimating the external voltage parameter in the FitzHugh-Nagumo system for modeling the spiking dynamics of neurons.more » « less
-
arXiv:2401.07170v1 (Ed.)This paper considers online optimization for a system that performs a sequence of back-to-back tasks. Each task can be processed in one of multiple processing modes that affect the duration of the task, the reward earned, and an additional vector of penalties (such as energy or cost). Let A[k] be a random matrix of parameters that specifies the duration, reward, and penalty vector under each processing option for task k. The goal is to observe A[k] at the start of each new task k and then choose a processing mode for the task so that, over time, time average reward is maximized subject to time average penalty constraints. This is a renewal optimization problem and is challenging because the probability distribution for the A[k] sequence is unknown. Prior work shows that any algorithm that comes within ϵ of optimality must have (1/ϵ^2) convergence time. The only known algorithm that can meet this bound operates without time average penalty constraints and uses a diminishing stepsize that cannot adapt when probabilities change. This paper develops a new algorithm that is adaptive and comes within O(ϵ) of optimality for any interval of (1/ϵ^2) tasks over which probabilities are held fixed, regardless of probabilities before the start of the interval.more » « less
An official website of the United States government

