MAPPO
Published:
MAPPO
Implementation of MAPPO reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: Atari
- Category: Other
MAPPO agents cooperating in Simple Spread environment
Overview
Multi-Agent Proximal Policy Optimization (MAPPO) is a centralized training with decentralized execution (CTDE) algorithm that extends PPO to multi-agent settings. MAPPO uses a centralized critic during training while maintaining decentralized policies for execution, making it highly effective for cooperative multi-agent tasks.
Algorithm Theory
Core Concept
MAPPO operates under the Centralized Training, Decentralized Execution (CTDE) paradigm, where agents share information during training but act independently during execution. This approach allows agents to leverage global information for better coordination while maintaining the benefits of decentralized execution.
Key Components
1. Centralized Training
- All agents share a centralized critic network
- Global state information is available during training
- Joint optimization of all agent policies
2. Decentralized Execution
- Each agent has its own policy network
- Agents act based on local observations only
- No communication required during execution
3. Proximal Policy Optimization
- Uses PPO’s clipped objective function for stable updates
- Trust region optimization prevents large policy changes
- Entropy regularization encourages exploration
4. Random Network Distillation (RND) Variants
- Intrinsic motivation for exploration
- Helps agents discover novel strategies
- Improves performance in complex environments
Implementation Details
Network Architecture
Centralized Critic
class CentralizedCritic(nn.Module):
def __init__(self, global_state_dim, num_agents):
self.network = nn.Sequential(
layer_init(nn.Linear(global_state_dim, 128)),
nn.Tanh(),
layer_init(nn.Linear(128, 128)),
nn.Tanh(),
layer_init(nn.Linear(128, 1), std=1.0)
)
Decentralized Actors
class Actor(nn.Module):
def __init__(self, observation_dim, action_dim):
self.network = nn.Sequential(
layer_init(nn.Linear(observation_dim, 128)),
nn.Tanh(),
layer_init(nn.Linear(128, 128)),
nn.Tanh(),
)
self.actor = layer_init(nn.Linear(128, 64), std=0.01)
Training Process
- Environment Interaction
- Multiple parallel environments (15 by default)
- Agents interact using decentralized policies
- Global state information is collected for critic
- Experience Collection
- Rollout length: 256 steps per environment (longer than IPPO)
- Store local observations, actions, rewards, global states
- Compute advantages using centralized critic
- Policy Updates
- PPO epochs: 10 (more than IPPO for better convergence)
- Minibatch size: 3840 (15 envs × 256 steps)
- Learning rate: 2.5e-4 with linear annealing
- Optimization
- Adam optimizer with gradient clipping (0.5)
- Orthogonal initialization for stable training
- Entropy coefficient: 0.02 for enhanced exploration
Supported Environments
1. Simple Spread (Cooperative)
- Environment:
simple_spread_v3
- Task: Cooperative navigation where agents must cover landmarks
- Actions: Discrete (5 actions per agent)
- Observations: Vector observations with agent positions
- Global State: Full environment state including all agent positions
2. Cooperative Pong (Butterfly)
- Environment:
cooperative_pong_v5
- Task: Cooperative version of Pong where agents work together
- Actions: Discrete actions for paddle movement
- Observations: Image-based observations
- Global State: Full game state including ball and paddle positions
3. RND-Enhanced Environments
- Purpose: Improved exploration through intrinsic motivation
- Implementation: RND networks provide additional reward signals
- Benefits: Better performance in complex, sparse-reward environments
Usage
Installation
pip install torch pettingzoo[mpe,butterfly] supersuit wandb tqdm imageio opencv-python gymnasium
Training Commands
Standard MAPPO (Simple Spread)
python mappo_without_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000
MAPPO with RND
python mappo_rnd.py --env_id simple_spread_v3 --total_timesteps 20000000
MAPPO for Cooperative Pong
python mappo_rnd_pong.py --env_id cooperative_pong_v5 --total_timesteps 10000000
MAPPO Training Script
python train.py --env_id cooperative_pong_v5 --total_timesteps 10000000
Key Hyperparameters
# Training Configuration
lr = 2.5e-4 # Learning rate
num_envs = 15 # Parallel environments
max_steps = 256 # Rollout length (longer than IPPO)
PPO_EPOCHS = 10 # PPO update epochs (more than IPPO)
clip_coeff = 0.2 # PPO clipping coefficient
ENTROPY_COEFF = 0.02 # Entropy regularization (higher than IPPO)
GAE = 0.95 # GAE lambda parameter
total_timesteps = 20000000 # Total training steps
Evaluation
# Evaluate trained model
python mappo_without_rnd.py --eval --checkpoint "checkpoint.pt"
# Interactive play
python play_ippo.py "checkpoint.pt"
Technical Implementation
File Structure
MAPPO/
├── mappo_without_rnd.py # Standard MAPPO implementation
├── mappo_rnd.py # MAPPO with RND for exploration
├── mappo_rnd_pong.py # MAPPO with RND for cooperative Pong
├── train.py # MAPPO training script
├── images/ # Training visualizations
│ └── simple_spread.mp4 # Demo video
└── README.md # This file
Key Classes
Config
Centralized configuration class containing all hyperparameters and training settings.
CentralizedCritic
Global value function that has access to the full environment state.
Actor Networks
Decentralized policy networks for each agent.
MAPPO Trainer
Main training loop implementing the MAPPO algorithm with centralized training.
RND Integration
Random Network Distillation
RND provides intrinsic motivation by measuring how “surprising” or “novel” an observation is:
class RNDNetwork(nn.Module):
def __init__(self, observation_dim):
self.predictor = nn.Sequential(...) # Predicts target features
self.target = nn.Sequential(...) # Fixed target network
References
Papers
- The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games
- Proximal Policy Optimization Algorithms
- Exploration by Random Network Distillation
- Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Code References
Contributing
This implementation is part of a larger MARL research project. Contributions are welcome in the form of:
- Bug reports and fixes
- Performance improvements
- New environment support
- Algorithm extensions
License
This implementation is open source and available under the MIT License.
Source Code
📁 GitHub Repository: Mappo (MARL Mappo)
View the complete implementation, training scripts, and documentation on GitHub.