PPO

Published: August 21, 2025

PPO

Category: Actor-Critic

Framework: PyTorch

Environment: Atari

Created: August 21, 2025

GitHub: View Implementation

Implementation of PPO reinforcement learning algorithm

Technical Details

Framework: PyTorch
Environment: Atari
Category: Other This directory contains implementations of the Proximal Policy Optimization (PPO) algorithm for various environments in PyTorch.

Overview

PPO is a state-of-the-art policy gradient method that combines the stability of trust region methods with the simplicity and efficiency of first-order optimization. It addresses the issue of choosing the right step size when optimizing policies by introducing a clipped surrogate objective function.

Key features of this implementation:

Clipped surrogate objective function for stable policy updates
Actor-Critic architecture with value function for advantage estimation
Generalized Advantage Estimation (GAE) for reduced variance
Configurable hyperparameters for different environments
Two implementations: separate actor-critic networks and unified network

Implementations

This repository includes two main PPO implementations:

Standard PPO (train.py): Uses separate networks for the actor (policy) and critic (value function). Applied to the LunarLander environment.

Environments

This implementation has been tested on:

CartPole-v1: A classic control task where a pole is attached to a cart that moves along a frictionless track.
LunarLander-v3: A more complex environment where an agent must land a lunar module on a landing pad.
Pendulum-v1: A continuous control task where the agent learns to balance a pendulum by applying torque.
BipedalWalker-v3: A continuous control environment where the agent learns to walk forward using bipedal locomotion.
CarRacing-v3: A continuous control environment where the agent learns to drive a car around a track from a top-down view.
ViZDoom Basic: A 3D first-person shooter environment where the agent learns to navigate and collect health packs.

For MuJoCo environments (HalfCheetah, Humanoid, Ant, etc.), see the dedicated MuJoCo folder with specialized implementations and detailed documentation.

Configuration

The implementation uses a Config class that specifies hyperparameters for training:

exp_name: Name of the experiment
seed: Random seed for reproducibility
env_id: ID of the Gymnasium environment
episodes: Number of episodes to train
lr / learning_rate: Learning rate for the optimizer
gamma: Discount factor
clip_value: PPO clipping parameter (epsilon)
PPO_EPOCHS: Number of optimization epochs per batch
ENTROPY_COEFF: Coefficient for entropy bonus
max_steps: Maximum number of steps per episode (for train.py)

Additional important parameters:

VALUE_COEFF: Coefficient for the value function loss

Algorithm Details

PPO works by:

Collecting Experience: The agent interacts with the environment to collect trajectories.
Computing Advantages: Generalized Advantage Estimation (GAE) is used to estimate the advantage function.
Policy Update: The policy is updated using the clipped surrogate objective:
```
L = min(r_t(θ) * A_t, clip(r_t(θ), 1-ε, 1+ε) * A_t)
```
where r_t(θ) is the ratio of new to old policy probabilities, A_t is the advantage, and ε is the clip parameter.
Value Function Update: The value function is updated to better predict returns.

The clipping mechanism prevents too large policy updates, improving stability without the computational overhead of trust region methods like TRPO.

Architecture

Standard PPO (train.py)

Actor Network: Maps states to action probabilities
Critic Network: Estimates the value function

Unified PPO (train_unified.py)

Shared Layers: Extract features from the state
Policy Head: Maps features to action probabilities
Value Head: Maps features to state value estimates

Logging and Monitoring

Training progress is logged using:

TensorBoard: Local visualization of training metrics
Weights & Biases (WandB): Cloud-based experiment tracking
Video Capture: Records videos of agent performance at intervals

Results

LunarLander

The following image shows the training performance on the LunarLander environment:

LunarLander Training Results

Cartpole

The following image shows the training performance on the CartPole environment:

CartPole Training Results

ViZDoom Basic

The following image shows the training performance on the ViZDoom Basic environment:

ViZDoom Basic Training Results

Agent gameplay demonstration:

ViZDoom Basic Gameplay

ViZDoom Defend the Center

PPO has been successfully applied to the ViZDoom Defend the Center environment, a challenging 3D first-person shooter task where the agent must defend against enemies approaching from all directions.

Agent gameplay demonstration:

ViZDoom Defend the Center Gameplay

Detailed training results and analysis can be found in this comprehensive report: VizDoom Defend The Center PPO - WandB Report

Car Racing

The following image shows the training performance on the CarRacing-v3 environment:

Car Racing Training Results

Car Racing Output

PPO has been successfully applied to the CarRacing-v3 environment, a challenging continuous control task where the agent must learn to drive a car around a randomly generated track. The environment features:

Continuous action space: Steering, acceleration, and braking
High-dimensional visual input: 96x96 RGB images from a top-down view
Complex dynamics: Realistic car physics and track generation

Detailed training results and analysis can be found in this comprehensive report: PPO on Car Racing v3 - WandB Report

Pendulum

PPO has been successfully applied to the Pendulum-v1 environment, a classic continuous control task where the agent must learn to balance a pendulum by applying torque. The environment features:

Continuous action space: Single continuous action (torque) between -2 and 2
Continuous state space: 3-dimensional observation (cos(θ), sin(θ), angular velocity)
Challenging dynamics: The agent must learn to swing up and balance the pendulum at the upright position

The following shows the training performance and agent behavior:

Pendulum Training Results

Detailed training results and analysis can be found in this comprehensive report: PPO on Pendulum-v1 - WandB Report

BipedalWalker

PPO has been successfully applied to the BipedalWalker-v3 environment, a challenging continuous control task where the agent must learn to walk forward using bipedal locomotion. The environment features:

Continuous action space: 4-dimensional continuous actions controlling hip and knee torques for both legs
Continuous state space: 24-dimensional observation including hull angle, angular velocity, leg positions, and velocities
Challenging dynamics: The agent must learn to coordinate multiple joints to achieve stable walking while maintaining balance

The following shows the training performance and agent behavior:

BipedalWalker Training Results

Detailed training results and analysis can be found in this comprehensive report: PPO on BipedalWalker-v3 - WandB Report

Dependencies

PyTorch
Gymnasium
NumPy
WandB (optional, for experiment tracking)
TensorBoard
OpenCV
Tqdm

References

CleanRL - Inspiration for code structure and implementation style

Source Code

📁 GitHub Repository: PPO (PPO)

View the complete implementation, training scripts, and documentation on GitHub.

Share on

Twitter Facebook LinkedIn

Yuvraj Singh

PPO