Atari
Implementation of Atari reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: Atari
- Category: Other
This directory contains Proximal Policy Optimization (PPO) implementations for training agents on classic Atari games using PyTorch and Gymnasium.
๐ฎ Overview
This implementation features vectorized PPO training on Atari environments with proper preprocessing, frame stacking, and environment wrappers. The code includes both custom implementations and Stable Baselines3 benchmarks for comparison.
๐ Trained Agents in Action
Pong

PPO agent playing Pong after 10M training steps - achieving consistent wins against the built-in AI opponent
Bowling

PPO agent playing Bowling after 10M training steps - demonstrating learned bowling strategies and consistent scoring
Boxing

PPO agent playing Boxing after 10M training steps - showing aggressive fighting strategies and effective combat techniques
๐ฏ Supported Games
- Pong (
PongNoFrameskip-v4) - Classic paddle game - Boxing (
BoxingNoFrameskip-v4) - Fighting game - Bowling (
BowlingNoFrameskip-v4) - Bowling simulation
๐ Files
atari-pong.py- PPO implementation for Pongboxing-atari.py- PPO implementation for Boxingatari-bowling.py- PPO implementation for Bowlingsb3-atari-benchmark.py- Stable Baselines3 benchmark comparisonimages/- Contains training videos and results
๐ Features
Core PPO Implementation
- Vectorized Training: 8 parallel environments for efficient data collection
- Generalized Advantage Estimation (GAE): ฮป = 0.95 for bias-variance tradeoff
- Clipped Surrogate Objective: Prevents destructive policy updates
- Value Function Clipping: Stabilizes critic training
- Gradient Clipping: Prevents exploding gradients (max_grad_norm = 0.5)
Atari-Specific Features
- Frame Preprocessing: Grayscale conversion and resizing to 64x64
- Frame Stacking: 4 consecutive frames as input
- Standard Atari Wrappers:
NoopResetEnv: Random no-op actions at episode startMaxAndSkipEnv: Frame skipping and max poolingEpisodicLifeEnv: Treats life loss as episode endFireResetEnv: Automatically fires at episode startClipRewardEnv: Clips rewards to [-1, 1]
Network Architecture
CNN Feature Extractor:
- Conv2d(4, 32, kernel=8, stride=4) + ReLU
- Conv2d(32, 64, kernel=4, stride=2) + ReLU
- Conv2d(64, 64, kernel=3, stride=1) + ReLU
- Flatten
- Linear(64*7*7, 512) + ReLU
Actor Head: Linear(512, action_space)
Critic Head: Linear(512, 1)
โ๏ธ Hyperparameters
| Parameter | Value | Description |
|---|---|---|
| Learning Rate | 2.5e-4 | Adam optimizer learning rate |
| Discount Factor | 0.99 | Reward discount factor |
| Parallel Envs | 8 | Number of vectorized environments |
| Steps per Rollout | 128 | Steps collected per environment |
| Minibatches | 4 | Number of minibatches per update |
| PPO Epochs | 4 | Training epochs per rollout |
| Clip Range | 0.1 | PPO clipping parameter |
| Entropy Coeff | 0.01 | Entropy bonus coefficient |
| Value Coeff | 0.5 | Value loss coefficient |
| GAE Lambda | 0.95 | Advantage estimation parameter |
๐โโ๏ธ Quick Start
Prerequisites
pip install torch gymnasium ale-py stable-baselines3 wandb opencv-python imageio
Training
# Train on Pong
python atari-pong.py
# Train on Boxing
python boxing-atari.py
# Train on Bowling
python atari-bowling.py
# Run SB3 benchmark
python sb3-atari-benchmark.py
Configuration
Edit the Config class in each file to modify hyperparameters:
total_timesteps: Total training steps (default: 10M)env_id: Environment namelr: Learning ratenum_envs: Number of parallel environmentsuse_wandb: Enable Weights & Biases logging
๐ Results
Training Videos
Training videos are saved in the images/ directory:
pong.mp4- Trained Pong agent gameplayboxing.mp4- Trained Boxing agent gameplaybowling.mp4- Trained Bowling agent gameplay
Detailed Training Reports
๐ PPO Atari Pong - Training Report
๐ณ PPO Atari Bowling - Training Report
๐ฅ PPO Atari Boxing - Training Report
The W&B reports include:
- Training curves and learning progression
- Hyperparameter sweeps and optimization
- Performance metrics and comparisons
- Loss functions and gradient analysis
- Real-time training monitoring
๐ Performance
The implementation typically achieves:
- Pong: 15-20 average reward after 10M steps
- Boxing: 80-95 average reward after 10M steps
- Bowling: 40-60 average reward after 10M steps
๐ง Technical Details
Environment Preprocessing
- Observation: 210x160x3 RGB frames โ 64x64 grayscale
- Frame Stacking: 4 consecutive frames
- Reward Clipping: Rewards clipped to [-1, 1]
- Life Management: Episode ends on life loss
Training Loop
- Rollout Collection: Collect trajectories from vectorized environments
- Advantage Estimation: Compute GAE advantages
- Policy Update: Multiple PPO epochs with minibatch updates
- Value Update: Train critic with clipped value loss
๐ฏ Key Insights
- Vectorization: Dramatically improves sample efficiency
- Frame Stacking: Provides temporal information for decision making
- Proper Preprocessing: Essential for stable Atari training
- Clipping: Prevents destructive policy updates
- GAE: Reduces variance in advantage estimation
๐ Monitoring
- Tensorboard: Real-time training metrics
- Weights & Biases: Experiment tracking and visualization
- Video Recording: Periodic agent gameplay videos
- Console Logging: Episode rewards and training progress
๐ References
- Proximal Policy Optimization
- Human-level control through deep reinforcement learning
- Generalized Advantage Estimation
๐ก Tips for Success
- Sufficient Training Time: Atari games require 10M+ steps
- Proper Preprocessing: Use standard Atari wrappers
- Stable Learning Rate: 2.5e-4 works well for most games
- Monitor Training: Watch for policy collapse or instability
- Vectorization: Use multiple environments for efficiency
Source Code
๐ GitHub Repository: Atari (PPO Atari)
View the complete implementation, training scripts, and documentation on GitHub.