skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2238979

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Larochelle, Hugo; Murray, Naila; Kamath, Gautam; Shah, Nihar B (Ed.)
    Gaussian Mixture Models (GMMs) have been recently proposed for approximating actors in actor-critic reinforcement learning algorithms. Such GMM-based actors are commonly optimized using stochastic policy gradients along with an entropy maximization objective. In contrast to previous work, we define and study deterministic policy gradients for optimiz- ing GMM-based actors. Similar to stochastic gradient approaches, our proposed method, denoted Gaussian Mixture Deterministic Policy Gradient (Gamid-PG), encourages policy entropy maximization. To this end, we define the GMM entropy gradient using Varia- tional Approximation of the KL-divergence between the GMM’s component Gaussians. We compare Gamid-PG with common stochastic policy gradient methods on benchmark dense- reward MuJoCo tasks and sparse-reward Fetch tasks. We observe that Gamid-PG outper- forms stochastic gradient-based methods in 3/6 MuJoCo tasks while performing similarly on the remaining 3 tasks. In the Fetch tasks, Gamid-PG outperforms single-actor determinis- tic gradient-based methods while performing worse than stochastic policy gradient methods. Consequently, we conclude that GMMs optimized using deterministic policy gradients (1) should be favorably considered over stochastic gradients in dense-reward continuous control tasks, and (2) improve upon single-actor deterministic gradients. 
    more » « less
    Free, publicly-accessible full text available December 1, 2025
  2. This paper investigates methods for training parameterized functions for guiding state-space search algorithms. Existing work commonly generates data for training such guiding functions by solving problem instances while leveraging the current version of the guiding function. As a result, as training progresses, the guided search algorithm can solve more difficult instances that are, in turn, used to further train the guiding function. These methods assume that a set of problem instances of varied difficulty is provided. Since previous work was not designed to distinguish the instances that the search algorithm can solve from those that cannot be solved with the current guiding function, the algorithm commonly wastes time attempting and failing to solve many of these instances. In this paper, we improve upon these training methods by generating a curriculum for learning the guiding function that directly addresses this issue. Namely, we propose and evaluate a Teacher-Student Curriculum (TSC) approach where the teacher is an evolutionary strategy that attempts to generate problem instances of ``correct difficulty'' and the student is a guided search algorithm utilizing the current guiding function. The student attempts to solve the problem instances generated by the teacher. We conclude with experiments demonstrating that TSC outperforms the current state-of-the-art Bootstrap Learning method in three representative benchmark domains and three guided search algorithms, with respect to the time required to solve all instances of the test set. 
    more » « less
  3. Comprehensive state-action exploration is essential for reinforcement learning (RL) algorithms. It enables them to find optimal solutions and avoid premature convergence. In value-based RL, optimistic initialization of the value function ensures sufficient exploration for finding the optimal solution. Optimistic values lead to curiosity-driven exploration enabling visitation of under-explored regions. However, optimistic initialization has limitations in stochastic and non-stationary environments due to its inability to explore ''infinitely-often''. To address this limitation, we propose a novel exploration strategy for value-based RL, denoted COIN, based on recurring optimistic initialization. By injecting a continual exploration bonus, we overcome the shortcoming of optimistic initialization (sensitivity to environment noise). We provide a rigorous theoretical comparison of COIN versus existing popular exploration strategies and prove it provides a unique set of attributes (coverage, infinite-often, no visitation tracking, and curiosity). We demonstrate the superiority of COIN over popular existing strategies on a designed toy domain as well as present results on common benchmark tasks. We observe that COIN outperforms existing exploration strategies in four out of six benchmark tasks while performing on par with the best baseline on the other two tasks. 
    more » « less
  4. The A* algorithm is commonly used to solve NP-hard combinatorial optimization problems. When provided with a completely informed heuristic function, A* can solve such problems in time complexity that is polynomial in the solution cost and branching factor. In light of this fact, we examine a line of recent publications that propose fitting deep neural networks to the completely informed heuristic function. We assert that these works suffer from inherent scalability limitations since --- under the assumption of NP P/poly --- such approaches result in either (a) network sizes that scale super-polynomially in the instance sizes or (b) the accuracy of the fitted deep neural networks scales inversely with the instance sizes. Complementing our theoretical claims, we provide experimental results for three representative NP-hard search problems. The results suggest that fitting deep neural networks to informed heuristic functions requires network sizes that grow quickly with the problem instance size. We conclude by suggesting that the research community should focus on scalable methods for integrating heuristic search with machine learning, as opposed to methods relying on informed heuristic estimation. 
    more » « less
  5. To maximize indoor daylight, design projects commonly use commercial optimization tools to find optimum window configurations. However, experiments show that such tools either fail to find the optimal solution or are very slow to compute in certain conditions.This paper presents a comparative analysis between a gradient-free optimization technique, Covariance Matrix Adaptation Evolution Strategy (CMA-ES), and the widely used Genetic Algorithm (GA)-based tool, Galapagos, to optimize window parameters to improve indoor daylight in six locations across different latitudes. A novel combination of daylight metrics, sDA, and ASE, is proposed for single-objective optimization comparison. Results indicate that GA in Galapagos takes progressively more time to converge, from 11 minutes in southernmost to 11 hours in northernmost latitudes, while runtime for CMA-ES is consistently around 2 hours. On average, CMA-ES is 1.5 times faster than Galapagos, while consistently producing optimal solutions. This paper can help researchers in selecting appropriate optimization algorithms for daylight simulation based on latitudes, runtime, and solution quality. 
    more » « less
  6. Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task-phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task-phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance. 
    more » « less
  7. We address a mechanism design problem where the goal of the designer is to maximize the entropy of a player's mixed strategy at a Nash equilibrium. This objective is of special relevance to video games where game designers wish to diversify the players' interaction with the game. To solve this design problem, we propose a bi-level alternating optimization technique that (1) approximates the mixed strategy Nash equilibrium using a Nash Monte-Carlo reinforcement learning approach and (2) applies a gradient-free optimization technique (Covariance-Matrix Adaptation Evolutionary Strategy) to maximize the entropy of the mixed strategy obtained in level (1). The experimental results show that our approach achieves comparable results to the state-of-the-art approach on three benchmark domains "Rock-Paper-Scissors-Fire-Water", "Workshop Warfare" and "Pokemon Video Game Championship". Next, we show that, unlike previous state-of-the-art approaches, the computational complexity of our proposed approach scales significantly better in larger combinatorial strategy spaces. 
    more » « less