Multiagent teams have been shown to be effective in many domains that require coordination among team members. However, finding valuable joint-actions becomes increasingly difficult in tightly-coupled domains where each agent’s performance depends on the actions of many other agents. Reward shaping partially addresses this challenge by deriving more “tuned" rewards to provide agents with additional feedback, but this approach still relies on agents ran- domly discovering suitable joint-actions. In this work, we introduce Counterfactual Agent Suggestions (CAS) as a method for injecting knowledge into an agent’s learning process within the confines of existing reward structures. We show that CAS enables agent teams to converge towards desired behaviors more reliably. We also show that improvement in team performance in the presence of suggestions extends to large teams and tightly-coupled domains.
more »
« less
Multi-level Fitness Critics for Cooperative Coevolution
In many multiagent domains, and particularly in tightly coupled domains, teasing an agent’s contribution to the system performance based on a single episodic return is difficult. This well-known difficulty hits state-to-action mapping approaches such as neural net- works trained by evolutionary algorithms particularly hard. This paper introduces fitness critics, which leverage the expected fitness to evaluate an agent’s performance. This approach turns a sparse performance metric (policy evaluation) into a dense performance metric (state-action evaluation) by relating the episodic feedback to the state-action pairs experienced during the execution of that policy. In the tightly-coupled multi-rover domain (where multiple rovers have to perform a particular task simultaneously), only teams using fitness critics were able to demonstrate effective learning on tasks with tight coupling while other coevolved teams were unable to learn at all.
more »
« less
- Award ID(s):
- 1815886
- PAR ID:
- 10197808
- Date Published:
- Journal Name:
- AAMAS Conference proceedings
- ISSN:
- 2523-5699
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL)—which is crucial for hyperparameter tuning—is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari. To address performance degradation due to poor critics in continuous-action domains, we further combine BVFT with OPE to get the best of both worlds, and obtain a hyperparameter-tuning method for Q-function based OPE with theoretical guarantees as a side product.more » « less
-
Silva, S; Paquete, L (Ed.)Coevolving teams of agents promises effective solutions for many coordination tasks such as search and rescue missions or deep ocean exploration. Good team performance in such domains generally relies on agents discovering complex joint policies, which is particularly difficult when the fitness functions are sparse (where many joint policies return the same or even zero fitness values). In this paper, we introduce Novelty Seeking Multiagent Evolutionary Reinforcement Learning (NS-MERL), which enables agents to more efficiently explore their joint strategy space. The key insight of NS-MERL is to promote good exploratory behaviors for individual agents using a dense, novelty-based fitness function. Though the overall team-level performance is still evaluated via a sparse fitness function, agents using NS-MERL more efficiently explore their joint action space and more readily discover good joint policies. Our results in complex coordination tasks show that teams of agents trained with NS-MERL perform significantly better than agents trained solely with task-specific fitnesses.more » « less
-
In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamental issues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.more » « less
-
null (Ed.)During the 1990s and early 2000s, the affirmative action context in the United States changed. Affirmative action in higher education was banned in several states, and the Supreme Court ruled in Grutter (2003) that affirmative action, while constitutional, should be implemented via holistic evaluation of applicants. In this article, we use two datasets to examine how affirmative action context relates to academic outcomes at selective colleges and universities in the United States before and after the Grutter decision and in states with and without bans on affirmative action. Underrepresented minority students earned higher grades in the period after the Grutter decision than before it, indicating that the holistic evaluation method required by Grutter may enhance educational outcomes for these students. In contrast, we find no support for the idea, proposed by critics of the policy, that banning affirmative action leads to better collegiate outcomes for Black and Latino students at selective institutions.more » « less
An official website of the United States government

