MAPPO

Published: August 21, 2025

MAPPO

Category: Multi-Agent

Framework: PyTorch

Environment: Atari

Created: August 21, 2025

GitHub: View Implementation

Implementation of MAPPO reinforcement learning algorithm

Technical Details

Framework: PyTorch
Environment: Atari
Category: Other

MAPPO Simple Spread Demo
MAPPO agents cooperating in Simple Spread environment

Overview

Multi-Agent Proximal Policy Optimization (MAPPO) is a centralized training with decentralized execution (CTDE) algorithm that extends PPO to multi-agent settings. MAPPO uses a centralized critic during training while maintaining decentralized policies for execution, making it highly effective for cooperative multi-agent tasks.

Algorithm Theory

Core Concept

MAPPO operates under the Centralized Training, Decentralized Execution (CTDE) paradigm, where agents share information during training but act independently during execution. This approach allows agents to leverage global information for better coordination while maintaining the benefits of decentralized execution.

Key Components

1. Centralized Training

All agents share a centralized critic network
Global state information is available during training
Joint optimization of all agent policies

2. Decentralized Execution

Each agent has its own policy network
Agents act based on local observations only
No communication required during execution

3. Proximal Policy Optimization

Uses PPO’s clipped objective function for stable updates
Trust region optimization prevents large policy changes
Entropy regularization encourages exploration

4. Random Network Distillation (RND) Variants

Intrinsic motivation for exploration
Helps agents discover novel strategies
Improves performance in complex environments

Implementation Details

Network Architecture

Centralized Critic

class CentralizedCritic(nn.Module):
    def __init__(self, global_state_dim, num_agents):
        self.network = nn.Sequential(
            layer_init(nn.Linear(global_state_dim, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 1), std=1.0)
        )

Decentralized Actors

class Actor(nn.Module):
    def __init__(self, observation_dim, action_dim):
        self.network = nn.Sequential(
            layer_init(nn.Linear(observation_dim, 128)),
            nn.Tanh(),
            layer_init(nn.Linear(128, 128)),
            nn.Tanh(),
        )
        self.actor = layer_init(nn.Linear(128, 64), std=0.01)

Training Process

Environment Interaction
- Multiple parallel environments (15 by default)
- Agents interact using decentralized policies
- Global state information is collected for critic
Experience Collection
- Rollout length: 256 steps per environment (longer than IPPO)
- Store local observations, actions, rewards, global states
- Compute advantages using centralized critic
Policy Updates
- PPO epochs: 10 (more than IPPO for better convergence)
- Minibatch size: 3840 (15 envs × 256 steps)
- Learning rate: 2.5e-4 with linear annealing
Optimization
- Adam optimizer with gradient clipping (0.5)
- Orthogonal initialization for stable training
- Entropy coefficient: 0.02 for enhanced exploration

Supported Environments

1. Simple Spread (Cooperative)

Environment: simple_spread_v3
Task: Cooperative navigation where agents must cover landmarks
Actions: Discrete (5 actions per agent)
Observations: Vector observations with agent positions
Global State: Full environment state including all agent positions

2. Cooperative Pong (Butterfly)

Environment: cooperative_pong_v5
Task: Cooperative version of Pong where agents work together
Actions: Discrete actions for paddle movement
Observations: Image-based observations
Global State: Full game state including ball and paddle positions

3. RND-Enhanced Environments

Purpose: Improved exploration through intrinsic motivation
Implementation: RND networks provide additional reward signals
Benefits: Better performance in complex, sparse-reward environments

Usage

Installation

pip install torch pettingzoo[mpe,butterfly] supersuit wandb tqdm imageio opencv-python gymnasium

Training Commands

Standard MAPPO (Simple Spread)

python mappo_without_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000

MAPPO with RND

python mappo_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000

MAPPO for Cooperative Pong

python mappo_rnd_pong.py --env_id cooperative_pong_v5 --total_timesteps 10000000

MAPPO Training Script

python train.py --env_id cooperative_pong_v5 --total_timesteps 10000000

Key Hyperparameters

# Training Configuration
lr = 2.5e-4                    # Learning rate
num_envs = 15                  # Parallel environments
max_steps = 256               # Rollout length (longer than IPPO)
PPO_EPOCHS = 10               # PPO update epochs (more than IPPO)
clip_coeff = 0.2              # PPO clipping coefficient
ENTROPY_COEFF = 0.02          # Entropy regularization (higher than IPPO)
GAE = 0.95                    # GAE lambda parameter
total_timesteps = 20000000    # Total training steps

Evaluation

# Evaluate trained model
python mappo_without_rnd.py --eval --checkpoint "checkpoint.pt"

# Interactive play
python play_ippo.py "checkpoint.pt"

Technical Implementation

File Structure

MAPPO/
├── mappo_without_rnd.py    # Standard MAPPO implementation
├── mappo_rnd.py           # MAPPO with RND for exploration
├── mappo_rnd_pong.py      # MAPPO with RND for cooperative Pong
├── train.py               # MAPPO training script
├── images/                # Training visualizations
│   └── simple_spread.mp4  # Demo video
└── README.md             # This file

Key Classes

Config

Centralized configuration class containing all hyperparameters and training settings.

CentralizedCritic

Global value function that has access to the full environment state.

Actor Networks

Decentralized policy networks for each agent.

MAPPO Trainer

Main training loop implementing the MAPPO algorithm with centralized training.

RND Integration

Random Network Distillation

RND provides intrinsic motivation by measuring how “surprising” or “novel” an observation is:

class RNDNetwork(nn.Module):
    def __init__(self, observation_dim):
        self.predictor = nn.Sequential(...)  # Predicts target features
        self.target = nn.Sequential(...)     # Fixed target network

References

Papers

Code References

Contributing

This implementation is part of a larger MARL research project. Contributions are welcome in the form of:

Bug reports and fixes
Performance improvements
New environment support
Algorithm extensions

License

This implementation is open source and available under the MIT License.

Source Code

📁 GitHub Repository: Mappo (MARL Mappo)

View the complete implementation, training scripts, and documentation on GitHub.

Share on

Twitter Facebook LinkedIn

Yuvraj Singh

MAPPO

Technical Details

Overview

Algorithm Theory

Core Concept

Key Components

1. Centralized Training

2. Decentralized Execution

3. Proximal Policy Optimization

4. Random Network Distillation (RND) Variants

Implementation Details

Network Architecture

Centralized Critic

Decentralized Actors

Training Process

Supported Environments

1. Simple Spread (Cooperative)

2. Cooperative Pong (Butterfly)

3. RND-Enhanced Environments

Usage

Installation

Training Commands

Standard MAPPO (Simple Spread)

MAPPO with RND

MAPPO for Cooperative Pong

MAPPO Training Script

Key Hyperparameters

Evaluation

Technical Implementation

File Structure

Key Classes

Config

CentralizedCritic

Actor Networks

MAPPO Trainer

RND Integration

Random Network Distillation

References

Papers

Code References

Contributing

License

Source Code

Share on