MAPPO

Published:

MAPPO

Category: Multi-Agent
Framework: PyTorch
Environment: Atari
Created: August 21, 2025

Implementation of MAPPO reinforcement learning algorithm

Technical Details

  • Framework: PyTorch
  • Environment: Atari
  • Category: Other

MAPPO Simple Spread Demo
MAPPO agents cooperating in Simple Spread environment

Overview

Multi-Agent Proximal Policy Optimization (MAPPO) is a centralized training with decentralized execution (CTDE) algorithm that extends PPO to multi-agent settings. MAPPO uses a centralized critic during training while maintaining decentralized policies for execution, making it highly effective for cooperative multi-agent tasks.

Algorithm Theory

Core Concept

MAPPO operates under the Centralized Training, Decentralized Execution (CTDE) paradigm, where agents share information during training but act independently during execution. This approach allows agents to leverage global information for better coordination while maintaining the benefits of decentralized execution.

Key Components

1. Centralized Training

  • All agents share a centralized critic network
  • Global state information is available during training
  • Joint optimization of all agent policies

2. Decentralized Execution

  • Each agent has its own policy network
  • Agents act based on local observations only
  • No communication required during execution

3. Proximal Policy Optimization

  • Uses PPO’s clipped objective function for stable updates
  • Trust region optimization prevents large policy changes
  • Entropy regularization encourages exploration

4. Random Network Distillation (RND) Variants

  • Intrinsic motivation for exploration
  • Helps agents discover novel strategies
  • Improves performance in complex environments

Implementation Details

Network Architecture

Centralized Critic

class CentralizedCritic(nn.Module):
    def __init__(self, global_state_dim, num_agents):
        self.network = nn.Sequential(
            layer_init(nn.Linear(global_state_dim, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 1), std=1.0)
        )

Decentralized Actors

class Actor(nn.Module):
    def __init__(self, observation_dim, action_dim):
        self.network = nn.Sequential(
            layer_init(nn.Linear(observation_dim, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 128)),
            nn.Tanh(),
        )
        self.actor = layer_init(nn.Linear(128, 64), std=0.01)

Training Process

  1. Environment Interaction
    • Multiple parallel environments (15 by default)
    • Agents interact using decentralized policies
    • Global state information is collected for critic
  2. Experience Collection
    • Rollout length: 256 steps per environment (longer than IPPO)
    • Store local observations, actions, rewards, global states
    • Compute advantages using centralized critic
  3. Policy Updates
    • PPO epochs: 10 (more than IPPO for better convergence)
    • Minibatch size: 3840 (15 envs × 256 steps)
    • Learning rate: 2.5e-4 with linear annealing
  4. Optimization
    • Adam optimizer with gradient clipping (0.5)
    • Orthogonal initialization for stable training
    • Entropy coefficient: 0.02 for enhanced exploration

Supported Environments

1. Simple Spread (Cooperative)

  • Environment: simple_spread_v3
  • Task: Cooperative navigation where agents must cover landmarks
  • Actions: Discrete (5 actions per agent)
  • Observations: Vector observations with agent positions
  • Global State: Full environment state including all agent positions

2. Cooperative Pong (Butterfly)

  • Environment: cooperative_pong_v5
  • Task: Cooperative version of Pong where agents work together
  • Actions: Discrete actions for paddle movement
  • Observations: Image-based observations
  • Global State: Full game state including ball and paddle positions

3. RND-Enhanced Environments

  • Purpose: Improved exploration through intrinsic motivation
  • Implementation: RND networks provide additional reward signals
  • Benefits: Better performance in complex, sparse-reward environments

Usage

Installation

pip install torch pettingzoo[mpe,butterfly] supersuit wandb tqdm imageio opencv-python gymnasium

Training Commands

Standard MAPPO (Simple Spread)

python mappo_without_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000

MAPPO with RND

python mappo_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000

MAPPO for Cooperative Pong

python mappo_rnd_pong.py --env_id cooperative_pong_v5 --total_timesteps 10000000

MAPPO Training Script

python train.py --env_id cooperative_pong_v5 --total_timesteps 10000000

Key Hyperparameters

# Training Configuration
lr = 2.5e-4                    # Learning rate
num_envs = 15                  # Parallel environments
max_steps = 256               # Rollout length (longer than IPPO)
PPO_EPOCHS = 10               # PPO update epochs (more than IPPO)
clip_coeff = 0.2              # PPO clipping coefficient
ENTROPY_COEFF = 0.02          # Entropy regularization (higher than IPPO)
GAE = 0.95                    # GAE lambda parameter
total_timesteps = 20000000    # Total training steps

Evaluation

# Evaluate trained model
python mappo_without_rnd.py --eval --checkpoint "checkpoint.pt"

# Interactive play
python play_ippo.py "checkpoint.pt"

Technical Implementation

File Structure

MAPPO/
├── mappo_without_rnd.py    # Standard MAPPO implementation
├── mappo_rnd.py           # MAPPO with RND for exploration
├── mappo_rnd_pong.py      # MAPPO with RND for cooperative Pong
├── train.py               # MAPPO training script
├── images/                # Training visualizations
│   └── simple_spread.mp4  # Demo video
└── README.md             # This file

Key Classes

Config

Centralized configuration class containing all hyperparameters and training settings.

CentralizedCritic

Global value function that has access to the full environment state.

Actor Networks

Decentralized policy networks for each agent.

MAPPO Trainer

Main training loop implementing the MAPPO algorithm with centralized training.

RND Integration

Random Network Distillation

RND provides intrinsic motivation by measuring how “surprising” or “novel” an observation is:

class RNDNetwork(nn.Module):
    def __init__(self, observation_dim):
        self.predictor = nn.Sequential(...)  # Predicts target features
        self.target = nn.Sequential(...)     # Fixed target network

References

Papers

Code References


Contributing

This implementation is part of a larger MARL research project. Contributions are welcome in the form of:

  • Bug reports and fixes
  • Performance improvements
  • New environment support
  • Algorithm extensions

License

This implementation is open source and available under the MIT License.

Source Code

📁 GitHub Repository: Mappo (MARL Mappo)

View the complete implementation, training scripts, and documentation on GitHub.