We study the convergence of several natural policy gradient (NPG) methods in infinitehorizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in stateaction space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and coauthors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like logbarriers. Finally, we interpret the discretetime NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the inverse penalization strength.
 NSFPAR ID:
 10340929
 Date Published:
 Journal Name:
 Operations Research
 ISSN:
 0030364X
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Abstract 
The GromovWasserstein (GW) formalism can be seen as a generalization of the optimal transport (OT) formalism for comparing two distributions associated with different metric spaces. It is a quadratic optimization problem and solving it usually has computational costs that can rise sharply if the problem size exceeds a few hundred points. Recently fast techniques based on entropy regularization have being developed to solve an approximation of the GW problem quickly. There are issues, however, with the numerical convergence of those regularized approximations to the true GW solution. To circumvent those issues, we introduce a novel strategy to solve the discrete GW problem using methods taken from statistical physics. We build a temperaturedependent free energy function that reflects the GW problem’s constraints. To account for possible differences of scales between the two metric spaces, we introduce a scaling factor s in the definition of the energy. From the extremum of the free energy, we derive a mapping between the two probability measures that are being compared, as well as a distance between those measures. This distance is equal to the GW distance when the temperature goes to zero. The optimal scaling factor itself is obtained by minimizing the free energy with respect to s. We illustrate our approach on the problem of comparing shapes defined by unstructured triangulations of their surfaces. We use several synthetic and “real life” datasets. We demonstrate the accuracy and automaticity of our approach in nonrigid registration of shapes. We provide numerical evidence that there is a strong correlation between the GW distances computed from lowresolution, surfacebased representations of proteins and the analogous distances computed from atomistic models of the same proteins.more » « less

Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For
discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$$\gamma $$ $\gamma $ and the effective horizon$${\mathcal {S}}$$ $S$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\frac{1}{1\gamma }$$ $\frac{1}{1\gamma}$ can take$$\eta $$ $\eta $ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.$$\begin{aligned} \frac{1}{\eta } {\mathcal {S}}^{2^{\Omega \big (\frac{1}{1\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{array}{c}\frac{1}{\eta}{\leftS\right}^{{2}^{\Omega (\frac{1}{1\gamma})}}\phantom{\rule{0ex}{0ex}}\text{iterations}\end{array}$ 
Abstract Purpose The constrained one‐step spectral CT image reconstruction (cOSSCIR) algorithm with a nonconvex alternating direction method of multipliers optimizer is proposed for addressing computed tomography (CT) metal artifacts caused by beam hardening, noise, and photon starvation. The quantitative performance of cOSSCIR is investigated through a series of photon‐counting CT simulations.
Methods cOSSCIR directly estimates basis material maps from photon‐counting data using a physics‐based forward model that accounts for beam hardening. The cOSSCIR optimization framework places constraints on the basis maps, which we hypothesize will stabilize the decomposition and reduce streaks caused by noise and photon starvation. Another advantage of cOSSCIR is that the spectral data need not be registered, so that a ray can be used even if some energy window measurements are unavailable. Photon‐counting CT acquisitions of a virtual pelvic phantom with low‐contrast soft tissue texture and bilateral hip prostheses were simulated. Bone and water basis maps were estimated using the cOSSCIR algorithm and combined to form a virtual monoenergetic image for the evaluation of metal artifacts. The cOSSCIR images were compared to a “two‐step” decomposition approach that first estimated basis sinograms using a maximum likelihood algorithm and then reconstructed basis maps using an iterative total variation constrained least‐squares optimization (MLE+TV). Images were also compared to a nonspectral TV reconstruction of the total number of counts detected for each ray with and without normalized metal artifact reduction (NMAR) applied. The simulated metal density was increased to investigate the effects of increasing photon starvation. The quantitative error and standard deviation in regions of the phantom were compared across the investigated algorithms. The ability of cOSSCIR to reproduce the soft‐tissue texture, while reducing metal artifacts, was quantitatively evaluated.
Results Noiseless simulations demonstrated the convergence of the cOSSCIR and MLE+TV algorithms to the correct basis maps in the presence of beam‐hardening effects. When noise was simulated, cOSSCIR demonstrated a quantitative error of −1 HU, compared to 2 HU error for the MLE+TV algorithm and −154 HU error for the nonspectral TV+NMAR algorithm. For the cOSSCIR algorithm, the standard deviation in the central iodine region of interest was 20 HU, compared to 299 HU for the MLE+TV algorithm, 41 HU for the MLE+TV+Mask algorithm that excluded rays through metal, and 55 HU for the nonspectral TV+NMAR algorithm. Increasing levels of photon starvation did not impact the bias or standard deviation of the cOSSCIR images. cOSSCIR was able to reproduce the soft‐tissue texture when an appropriate regularization constraint value was selected.
Conclusions By directly inverting photon‐counting CT data into basis maps using an accurate physics‐based forward model and a constrained optimization algorithm, cOSSCIR avoids metal artifacts due to beam hardening, noise, and photon starvation. The cOSSCIR algorithm demonstrated improved stability and accuracy compared to a two‐step method of decomposition followed by reconstruction.

Integrating regularization methods with standard loss functions such as the least squares, hinge loss, etc., within a regression framework has become a popular choice for researchers to learn predictive models with lower variance and better generalization ability. Regularizers also aid in building interpretable models with highdimensional data which makes them very appealing. It is observed that each regularizer is uniquely formulated in order to capture dataspecific properties such as correlation, structured sparsity and temporal smoothness. The problem of obtaining a consensus among such diverse regularizers while learning a predictive model is extremely important in order to determine the optimal regularizer for the problem. The advantage of such an approach is that it preserves the simplicity of the final model learned by selecting a single candidate model which is not the case with ensemble methods as they use multiple candidate models for prediction. This is called the consensus regularization problem which has not received much attention in the literature due to the inherent difficulty associated with learning and selecting a model from an integrated regularization framework. To solve this problem, in this paper, we propose a method to generate a committee of nonconvex regularized linear regression models, and use a consensus criterion to determine the optimal model for prediction. Each corresponding nonconvex optimization problem in the committee is solved efficiently using the cycliccoordinate descent algorithm with the generalized thresholding operator. Our Consensus RegularIzation Selection based Prediction (CRISP) model is evaluated on electronic health records (EHRs) obtained from a large hospital for the congestive heart failure readmission prediction problem. We also evaluate our model on highdimensional synthetic datasets to assess its performance. The results indicate that CRISP outperforms several stateoftheart methods such as additive, interactionsbased and other competing nonconvex regularized linear regression methods.more » « less