Atari
Published:
Atari
Implementation of Atari reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: Atari
- Category: Other
This directory contains Proximal Policy Optimization (PPO) implementations for training agents on classic Atari games using PyTorch and Gymnasium.
🎮 Overview
This implementation features vectorized PPO training on Atari environments with proper preprocessing, frame stacking, and environment wrappers. The code includes both custom implementations and Stable Baselines3 benchmarks for comparison.
🏓 Trained Agents in Action
Pong
PPO agent playing Pong after 10M training steps - achieving consistent wins against the built-in AI opponent
Bowling
PPO agent playing Bowling after 10M training steps - demonstrating learned bowling strategies and consistent scoring
Boxing
PPO agent playing Boxing after 10M training steps - showing aggressive fighting strategies and effective combat techniques
🎯 Supported Games
- Pong (
PongNoFrameskip-v4
) - Classic paddle game - Boxing (
BoxingNoFrameskip-v4
) - Fighting game - Bowling (
BowlingNoFrameskip-v4
) - Bowling simulation
📁 Files
atari-pong.py
- PPO implementation for Pongboxing-atari.py
- PPO implementation for Boxingatari-bowling.py
- PPO implementation for Bowlingsb3-atari-benchmark.py
- Stable Baselines3 benchmark comparisonimages/
- Contains training videos and results
🚀 Features
Core PPO Implementation
- Vectorized Training: 8 parallel environments for efficient data collection
- Generalized Advantage Estimation (GAE): λ = 0.95 for bias-variance tradeoff
- Clipped Surrogate Objective: Prevents destructive policy updates
- Value Function Clipping: Stabilizes critic training
- Gradient Clipping: Prevents exploding gradients (max_grad_norm = 0.5)
Atari-Specific Features
- Frame Preprocessing: Grayscale conversion and resizing to 64x64
- Frame Stacking: 4 consecutive frames as input
- Standard Atari Wrappers:
NoopResetEnv
: Random no-op actions at episode startMaxAndSkipEnv
: Frame skipping and max poolingEpisodicLifeEnv
: Treats life loss as episode endFireResetEnv
: Automatically fires at episode startClipRewardEnv
: Clips rewards to [-1, 1]
Network Architecture
CNN Feature Extractor:
- Conv2d(4, 32, kernel=8, stride=4) + ReLU
- Conv2d(32, 64, kernel=4, stride=2) + ReLU
- Conv2d(64, 64, kernel=3, stride=1) + ReLU
- Flatten
- Linear(64*7*7, 512) + ReLU
Actor Head: Linear(512, action_space)
Critic Head: Linear(512, 1)
⚙️ Hyperparameters
Parameter | Value | Description |
---|---|---|
Learning Rate | 2.5e-4 | Adam optimizer learning rate |
Discount Factor | 0.99 | Reward discount factor |
Parallel Envs | 8 | Number of vectorized environments |
Steps per Rollout | 128 | Steps collected per environment |
Minibatches | 4 | Number of minibatches per update |
PPO Epochs | 4 | Training epochs per rollout |
Clip Range | 0.1 | PPO clipping parameter |
Entropy Coeff | 0.01 | Entropy bonus coefficient |
Value Coeff | 0.5 | Value loss coefficient |
GAE Lambda | 0.95 | Advantage estimation parameter |
🏃♂️ Quick Start
Prerequisites
pip install torch gymnasium ale-py stable-baselines3 wandb opencv-python imageio
Training
# Train on Pong
python atari-pong.py
# Train on Boxing
python boxing-atari.py
# Train on Bowling
python atari-bowling.py
# Run SB3 benchmark
python sb3-atari-benchmark.py
Configuration
Edit the Config
class in each file to modify hyperparameters:
total_timesteps
: Total training steps (default: 10M)env_id
: Environment namelr
: Learning ratenum_envs
: Number of parallel environmentsuse_wandb
: Enable Weights & Biases logging
📊 Results
Training Videos
Training videos are saved in the images/
directory:
pong.mp4
- Trained Pong agent gameplayboxing.mp4
- Trained Boxing agent gameplaybowling.mp4
- Trained Bowling agent gameplay
Detailed Training Reports
📈 PPO Atari Pong - Training Report
🎳 PPO Atari Bowling - Training Report
🥊 PPO Atari Boxing - Training Report
The W&B reports include:
- Training curves and learning progression
- Hyperparameter sweeps and optimization
- Performance metrics and comparisons
- Loss functions and gradient analysis
- Real-time training monitoring
📈 Performance
The implementation typically achieves:
- Pong: 15-20 average reward after 10M steps
- Boxing: 80-95 average reward after 10M steps
- Bowling: 40-60 average reward after 10M steps
🔧 Technical Details
Environment Preprocessing
- Observation: 210x160x3 RGB frames → 64x64 grayscale
- Frame Stacking: 4 consecutive frames
- Reward Clipping: Rewards clipped to [-1, 1]
- Life Management: Episode ends on life loss
Training Loop
- Rollout Collection: Collect trajectories from vectorized environments
- Advantage Estimation: Compute GAE advantages
- Policy Update: Multiple PPO epochs with minibatch updates
- Value Update: Train critic with clipped value loss
🎯 Key Insights
- Vectorization: Dramatically improves sample efficiency
- Frame Stacking: Provides temporal information for decision making
- Proper Preprocessing: Essential for stable Atari training
- Clipping: Prevents destructive policy updates
- GAE: Reduces variance in advantage estimation
🔍 Monitoring
- Tensorboard: Real-time training metrics
- Weights & Biases: Experiment tracking and visualization
- Video Recording: Periodic agent gameplay videos
- Console Logging: Episode rewards and training progress
📚 References
- Proximal Policy Optimization
- Human-level control through deep reinforcement learning
- Generalized Advantage Estimation
💡 Tips for Success
- Sufficient Training Time: Atari games require 10M+ steps
- Proper Preprocessing: Use standard Atari wrappers
- Stable Learning Rate: 2.5e-4 works well for most games
- Monitor Training: Watch for policy collapse or instability
- Vectorization: Use multiple environments for efficiency
Source Code
📁 GitHub Repository: Atari (PPO Atari)
View the complete implementation, training scripts, and documentation on GitHub.