Why do Plan Gradient Methods get the job done so nicely in Cooperative MARL? Proof from Policy Illustration

[ad_1]

In cooperative multi-agent reinforcement discovering (MARL), owing to its on-policy mother nature, policy gradient (PG) solutions are ordinarily considered to be fewer sample efficient than price decomposition (VD) solutions, which are off-plan. Nonetheless, some recent empirical research show that with appropriate input representation and hyper-parameter tuning, multi-agent PG can attain incredibly strong general performance when compared to off-coverage VD strategies.

Why could PG strategies work so perfectly? In this post, we will current concrete investigation to present that in particular eventualities, e.g., environments with a remarkably multi-modal reward landscape, VD can be problematic and guide to undesired results. By distinction, PG methods with particular person insurance policies can converge to an optimum coverage in these circumstances. In addition, PG strategies with car-regressive (AR) policies can master multi-modal policies.

Figure 1: distinct coverage illustration for the 4-player permutation game.

CTDE in Cooperative MARL: VD and PG approaches

Centralized training and decentralized execution (CTDE) is a well-liked framework in cooperative MARL. It leverages world-wide facts for a lot more efficient education even though keeping the representation of particular person guidelines for testing. CTDE can be carried out via worth decomposition (VD) or coverage gradient (PG), main to two diverse varieties of algorithms.

VD techniques master community Q networks and a mixing function that mixes the community Q networks to a world-wide Q operate. The mixing perform is ordinarily enforced to satisfy the Unique-World wide-Max (IGM) principle, which assures the best joint motion can be computed by greedily picking out the exceptional motion domestically for each agent.

By distinction, PG strategies immediately implement policy gradient to master an personal policy and a centralized value purpose for every agent. The worth purpose will take as its enter the international point out (e.g., MAPPO) or the concatenation of all the regional observations (e.g., MADDPG), for an exact international worth estimate.

The permutation match: a easy counterexample in which VD fails

We start our examination by looking at a stateless cooperative activity, specifically the permutation game. In an $N$-participant permutation match, each and every agent can output $N$ actions $ 1,ldots, N $. Agents obtain $+1$ reward if their actions are mutually distinctive, i.e., the joint action is a permutation around $1, ldots, N$ if not, they obtain $$ reward. Be aware that there are $N!$ symmetric optimum tactics in this recreation.

Figure 2: the 4-participant permutation recreation.

Determine 3: significant-amount instinct on why VD fails in the 2-player permutation match.

Allow us aim on the 2-player permutation sport now and utilize VD to the activity. In this stateless setting, we use $Q_1$ and $Q_2$ to denote the regional Q-features, and use $Q_textrmtot$ to denote the world wide Q-function. The IGM theory requires that

[\arg\max_a^1,a^2Q_\textrmtot(a^1,a^2)=\\arg\max_a^1Q_1(a^1),\arg\max_a^2Q_2(a^2)\.\]

We show that VD can’t symbolize the payoff of the 2-participant permutation activity by contradiction. If VD strategies were ready to stand for the payoff, we would have

[Q_\textrmtot(1, 2)=Q_\textrmtot(2,1)=1\quad \textand\quad Q_\textrmtot(1, 1)=Q_\textrmtot(2,2)=0.\]

If either of these two agents has different community Q values (e.g. $Q_1(1)> Q_1(2)$), we have $argmax_a^1Q_1(a^1)=1$. Then in accordance to the IGM theory, any optimal joint motion

[(a^1\star,a^2\star)=\arg\max_a^1,a^2Q_\textrmtot(a^1,a^2)=\\arg\max_a^1Q_1(a^1),\arg\max_a^2Q_2(a^2)\\]

satisfies $a^1star=1$ and $a^1starneq 2$, so the joint motion $(a^1,a^2)=(2,1)$ is sub-ideal, i.e., $Q_textrmtot(2,1)<1$.

Otherwise, if $Q_1(1)=Q_1(2)$ and $Q_2(1)=Q_2(2)$, then

[Q_\textrmtot(1, 1)=Q_\textrmtot(2,2)=Q_\textrmtot(1, 2)=Q_\textrmtot(2,1).\]

As a result, value decomposition cannot represent the payoff matrix of the 2-player permutation game.

What about PG methods? Individual policies can indeed represent an optimal policy for the permutation game. Moreover, stochastic gradient descent can guarantee PG to converge to one of these optima under mild assumptions. This suggests that, even though PG methods are less popular in MARL compared with VD methods, they can be preferable in certain cases that are common in real-world applications, e.g., games with multiple strategy modalities.

We also remark that in the permutation game, in order to represent an optimal joint policy, each agent must choose distinct actions. Consequently, a successful implementation of PG must ensure that the policies are agent-specific. This can be done by using either individual policies with unshared parameters (referred to as PG-Ind in our paper), or an agent-ID conditioned policy (PG-ID).

PG outperforms existing VD methods on popular MARL testbeds

Going beyond the simple illustrative example of the permutation game, we extend our study to popular and more realistic MARL benchmarks. In addition to StarCraft Multi-Agent Challenge (SMAC), where the effectiveness of PG and agent-conditioned policy input has been verified, we show new results in Google Research Football (GRF) and multi-player Hanabi Challenge.

Figure 4: (left) winning rates of PG methods on GRF (right) best and average evaluation scores on Hanabi-Full.

In GRF, PG methods outperform the state-of-the-art VD baseline (CDS) in 5 scenarios. Interestingly, we also notice that individual policies (PG-Ind) without parameter sharing achieve comparable, sometimes even higher winning rates, compared to agent-specific policies (PG-ID) in all 5 scenarios. We evaluate PG-ID in the full-scale Hanabi game with varying numbers of players (2-5 players) and compare them to SAD, a strong off-policy Q-learning variant in Hanabi, and Value Decomposition Networks (VDN). As demonstrated in the above table, PG-ID is able to produce results comparable to or better than the best and average rewards achieved by SAD and VDN with varying numbers of players using the same number of environment steps.

Besides learning higher rewards, we also study how to learn multi-modal policies in cooperative MARL. Let’s go back to the permutation game. Although we have proved that PG can effectively learn an optimal policy, the strategy mode that it finally reaches can highly depend on the policy initialization. Thus, a natural question will be:

Can we learn a single policy that can cover all the optimal modes?

In the decentralized PG formulation, the factorized representation of a joint policy can only represent one particular mode. Therefore, we propose an enhanced way to parameterize the policies for stronger expressiveness — the auto-regressive (AR) policies.

Figure 5: comparison between individual policies (PG) and auto-regressive policies (AR) in the 4-player permutation game.

Formally, we factorize the joint policy of $n$ agents into the form of

[\pi(\mathbfa \mid \mathbfo) \approx \prod_i=1^n \pi_\theta^i \left( a^i\mid o^i,a^1,\ldots,a^i-1 \right),\]

where the action produced by agent $i$ depends on its own observation $o_i$ and all the actions from previous agents $1,dots,i-1$. The auto-regressive factorization can represent any joint policy in a centralized MDP. The only modification to each agent’s policy is the input dimension, which is slightly enlarged by including previous actions and the output dimension of each agent’s policy remains unchanged.

With such a minimal parameterization overhead, AR policy substantially improves the representation power of PG methods. We remark that PG with AR policy (PG-AR) can simultaneously represent all optimal policy modes in the permutation game.

Figure: the heatmaps of actions for policies learned by PG-Ind (left) and PG-AR (middle), and the heatmap for rewards (right) while PG-Ind only converge to a specific mode in the 4-player permutation game, PG-AR successfully discovers all the optimal modes.

In more complex environments, including SMAC and GRF, PG-AR can learn interesting emergent behaviors that require strong intra-agent coordination that may never be learned by PG-Ind.

Figure 6: (left) emergent behavior induced by PG-AR in SMAC and GRF. On the 2m_vs_1z map of SMAC, the marines keep standing and attack alternately while ensuring there is only one attacking marine at each timestep (right) in the academy_3_vs_1_with_keeper scenario of GRF, agents learn a “Tiki-Taka” style behavior: each player keeps passing the ball to their teammates.

Discussions and Takeaways

In this post, we provide a concrete analysis of VD and PG methods in cooperative MARL. First, we reveal the limitation on the expressiveness of popular VD methods, showing that they could not represent optimal policies even in a simple permutation game. By contrast, we show that PG methods are provably more expressive. We empirically verify the expressiveness advantage of PG on popular MARL testbeds, including SMAC, GRF, and Hanabi Challenge. We hope the insights from this work could benefit the community towards more general and more powerful cooperative MARL algorithms in the future.

This post is based on our paper: Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning (paper, website).

[ad_2]

Source link