IPPO

Published:

IPPO

Category: Multi-Agent
Framework: PyTorch
Environment: Atari
Created: August 21, 2025

Implementation of IPPO reinforcement learning algorithm

Technical Details

  • Framework: PyTorch
  • Environment: Atari
  • Category: Other

IPPO Pong Demo
IPPO agents competing in Pong environment

Overview

Independent Proximal Policy Optimization (IPPO) is a state-of-the-art multi-agent reinforcement learning algorithm that extends the single-agent PPO algorithm to multi-agent settings. Unlike centralized training approaches, IPPO allows each agent to learn independently while sharing observation processing capabilities.

Algorithm Theory

Core Concept

IPPO operates on the principle that each agent can learn an optimal policy independently while sharing a common observation encoder. This approach is particularly effective in cooperative multi-agent environments where agents need to coordinate but can benefit from independent learning.

Key Components

1. Independent Learning

  • Each agent maintains its own policy network (actor) and value network (critic)
  • Agents learn independently without direct policy sharing
  • Shared observation processing reduces computational overhead

2. Proximal Policy Optimization

  • Uses PPO’s clipped objective function to ensure stable policy updates
  • Trust region optimization prevents large policy changes
  • Entropy regularization encourages exploration

3. Generalized Advantage Estimation (GAE)

  • Computes advantages using GAE with λ=0.95
  • Reduces variance in policy gradient estimates
  • Balances bias-variance trade-off in advantage estimation

Implementation Details

Network Architecture

Shared Observation Encoder

class SharedEncoder(nn.Module):
    def __init__(self, observation_dim):
        self.network = nn.Sequential(
            layer_init(nn.Linear(observation_dim, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
        )

Agent-Specific Heads

class Actor(nn.Module):
    def __init__(self, observation_dim, action_dim):
        # Shared feature extraction
        self.network = nn.Sequential(...)
        # Agent-specific actor head
        self.actor = layer_init(nn.Linear(64, action_dim), std=0.01)

class Critic(nn.Module):
    def __init__(self, observation_dim):
        # Shared feature extraction
        self.network = nn.Sequential(...)
        # Agent-specific critic head
        self.critic = layer_init(nn.Linear(64, 1), std=1.0)

Training Process

  1. Environment Interaction
    • Multiple parallel environments (15 by default)
    • Each agent interacts independently
    • Observations are processed through shared encoder
  2. Experience Collection
    • Rollout length: 128 steps per environment
    • Store observations, actions, rewards, values, log probabilities
    • Compute advantages using GAE
  3. Policy Updates
    • PPO epochs: 4
    • Minibatch size: 1920 (15 envs × 128 steps)
    • Learning rate: 2.5e-4 with linear annealing
  4. Optimization
    • Adam optimizer with gradient clipping (0.5)
    • Orthogonal initialization for stable training
    • Entropy coefficient: 0.001 for exploration

Supported Environments

1. Simple Spread (Discrete Actions)

  • Environment: simple_spread_v3
  • Task: Cooperative navigation where agents must cover landmarks
  • Actions: Discrete (5 actions per agent)
  • Observations: Vector observations with agent positions and landmark locations
  • Reward: Cooperative reward based on landmark coverage

2. Simple Tag (Competitive)

  • Environment: simple_tag_v3
  • Task: Competitive tagging game
  • Actions: Discrete actions for movement and tagging
  • Observations: Vector observations with agent positions
  • Reward: Competitive rewards for taggers and runners

3. Continuous Control

  • Environment: simple_spread_v3 (continuous variant)
  • Task: Same cooperative navigation with continuous actions
  • Actions: Continuous 2D movement vectors
  • Observations: Same vector observations
  • Reward: Same cooperative reward structure

Usage

Installation

pip install torch pettingzoo[mpe] supersuit wandb tqdm imageio opencv-python gymnasium

Training Commands

Discrete Actions (Simple Spread)

python ippo_discrete.py --env_id simple_spread_v3 --total_timesteps 20000000

Continuous Actions

python ippo_continuous.py --env_id simple_spread_v3 --total_timesteps 20000000

Simple Tag Environment

python ippo_simple_tag.py --env_id simple_tag_v3 --total_timesteps 20000000

Key Hyperparameters

# Training Configuration
lr = 2.5e-4                    # Learning rate
num_envs = 15                  # Parallel environments
max_steps = 128               # Rollout length
PPO_EPOCHS = 4                # PPO update epochs
clip_coeff = 0.2              # PPO clipping coefficient
ENTROPY_COEFF = 0.001         # Entropy regularization
GAE = 0.95                    # GAE lambda parameter
total_timesteps = 20000000    # Total training steps

Evaluation

# Evaluate trained model
python ippo_discrete.py --eval --checkpoint "checkpoint.pt"

# Interactive play (Pong)
python play_ippo.py "checkpoint.pt"

Technical Implementation

File Structure

IPPO/
├── ippo_discrete.py      # Discrete action implementation
├── ippo_continuous.py    # Continuous action implementation
├── ippo_simple_tag.py    # Simple Tag environment
├── play_ippo.py         # Interactive play script
├── images/              # Training visualizations
│   ├── pong.gif        # Demo video
│   └── image.png       # Training plots
└── README.md           # This file

Key Classes

Config

Centralized configuration class containing all hyperparameters and training settings.

Actor/Critic Networks

Agent-specific policy and value networks with shared observation processing.

IPPO Trainer

Main training loop implementing the IPPO algorithm with experience collection and policy updates.

References

Papers

Code References


Contributing

This implementation is part of a larger MARL research project. Contributions are welcome in the form of:

  • Bug reports and fixes
  • Performance improvements
  • New environment support
  • Algorithm extensions

License

This implementation is open source and available under the MIT License.

Source Code

📁 GitHub Repository: Ippo (MARL Ippo)

View the complete implementation, training scripts, and documentation on GitHub.