Synchronizing decisions across multiple agents in realistic settings is problematic since it requires agents to wait for other agents to terminate and communicate about termination reliably. Ideally, agents should learn and execute asynchronously instead. Such asynchronous methods also allow temporally extended actions that can take different amounts of time based on the situation and action executed. Unfortunately, current policy gradient methods are not applicable in asynchronous settings, as they assume that agents synchronously reason about action selection at every time step. To allow asynchronous learning and decision-making, we formulate a set of asynchronous multi-agent actor-critic methods that allow agents to directly optimize asynchronous policies in three standard training paradigms: decentralized learning, centralized learning, and centralized training for decentralized execution. Empirical results (in simulation and hardware) in a variety of realistic domains demonstrate the superiority of our approaches in large multi-agent problems and validate the effectiveness of our algorithms for learning high-quality and asynchronous solutions. 
                        more » 
                        « less   
                    
                            
                            Macro-Action-Based Deep Multi-Agent Reinforcement Learning
                        
                    
    
            In real-world multi-robot systems, performing high-quality, collaborative behaviors requires robots to asynchronously reason about high-level action selection at varying time durations. Macro-Action Decentralized Partially Observable Markov Decision Processes (MacDec-POMDPs) provide a general framework for asynchronous decision making under uncertainty in fully cooperative multi-agent tasks. However, multi-agent deep reinforcement learning methods have only been developed for (synchronous) primitive-action problems. This paper proposes two Deep Q-Network (DQN) based methods for learning decentralized and centralized macro-action-value functions with novel macro-action trajectory replay buffers introduced for each case. Evaluations on benchmark problems and a larger domain demonstrate the advantage of learning with macro-actions over primitive-actions and the scalability of our approaches. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1734497
- PAR ID:
- 10167549
- Date Published:
- Journal Name:
- Conference on Robot Learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Structural design synthesis considering discrete elements can be formulated as a sequential decision process solved using deep reinforcement learning, as shown in prior work. By modeling structural design synthesis as a Markov decision process (MDP), the states correspond to specific structural designs, the discrete actions correspond to specific design alterations, and the rewards are related to the improvement in the altered design’s performance with respect to the design objective and specified constraints. Here, the MDP action definition is extended by integrating parametric design grammars that further enable the design agent to not only alter a given structural design’s topology, but also its element parameters. In considering topological and parametric actions, both the dimensionality of the state and action space and the diversity of the action types available to the agent in each state significantly increase, making the overall MDP learning task more challenging. Hence, this paper also addresses discrete design synthesis problems with large state and action spaces by significantly extending the network architecture. Specifically, a hierarchical-inspired deep neural network architecture is developed to allow the agent to learn the type of action, topological or parametric, to apply, thus reducing the complexity of possible action choices in a given state. This extended framework is applied to the design synthesis of planar structures considering both discrete elements and cross-sectional areas, and it is observed to adeptly learn policies that synthesize high performing design solutions.more » « less
- 
            Learning safe solutions is an important but challenging problem in multi-agent reinforcement learning (MARL). Shielded reinforcement learning is one approach for preventing agents from choosing unsafe actions. Current shielded reinforcement learning methods for MARL make strong assumptions about communication and full observability. In this work, we extend the formalization of the shielded reinforcement learning problem to a decentralized multi-agent setting. We then present an algorithm for decomposition of a centralized shield, allowing shields to be used in such decentralized, communication-free environments. Our results show that agents equipped with decentralized shields perform comparably to agents with centralized shields in several tasks, allowing shielding to be used in environments with decentralized training and execution for the first time.more » « less
- 
            Constrained action-based decision-making is one of the most challenging decision-making problems. It refers to a scenario where an agent takes action in an environment not only to maximize the expected cumulative reward but where it is subject to certain actionbased constraints; for example, an upper limit on the total number of certain actions being carried out. In this work, we construct a general data-driven framework called Constrained Action-based Partially Observable Markov Decision Process (CAPOMDP) to induce effective pedagogical policies. Specifically, we induce two types of policies: CAPOMDP-LG using learning gain as reward with the goal of improving students’ learning performance, and CAPOMDP-Time using time as reward for reducing students’ time on task. The effectiveness ofCAPOMDP-LG is compared against a random yet reasonable policy and the effectiveness of CAPOMDP-Time is compared against both a Deep Reinforcement Learning induced policy and a random policy. Empirical results show that there is an Aptitude Treatment Interaction effect: students are split into High vs. Low based on their incoming competence; while no significant difference is found among the High incoming competence groups, for the Low groups, students following CAPOMDP-Time indeed spent significantly less time than those using the two baseline policies and students following CAPOMDP-LG significantly outperform their peers on both learning gain and learning efficiency.more » « less
- 
            Andreas Krause, Emma Brunskill (Ed.)Executing actions in a correlated manner is a common strategy for human coordination that often leads to better cooperation, which is also potentially beneficial for cooperative multi-agent reinforcement learning (MARL). However, the recent success of MARL relies heavily on the convenient paradigm of purely decentralized execution, where there is no action correlation among agents for scalability considerations. In this work, we introduce a Bayesian network to inaugurate correlations between agents’ action selections in their joint policy. Theoretically, we establish a theoretical justification for why action dependencies are beneficial by deriving the multi-agent policy gradient formula under such a Bayesian network joint policy and proving its global convergence to Nash equilibria under tabular softmax policy parameterization in cooperative Markov games. Further, by equipping existing MARL algorithms with a recent method of differentiable directed acyclic graphs (DAGs), we develop practical algorithms to learn the context-aware Bayesian network policies in scenarios with partial observability and various difficulty. We also dynamically decrease the sparsity of the learned DAG throughout the training process, which leads to weakly or even purely independent policies for decentralized execution. Empirical results on a range of MARL benchmarks show the benefits of our approach.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    