REINFORCE

Published: August 21, 2025

REINFORCE

Category: Actor-Critic

Framework: PyTorch

Environment: Gymnasium

Created: August 21, 2025

GitHub: View Implementation

Implementation of REINFORCE reinforcement learning algorithm

Technical Details

Framework: PyTorch
Environment: Gymnasium
Category: Other
Overview

This repository contains an implementation of the REINFORCE algorithm (also known as Monte Carlo Policy Gradient), a foundational policy gradient method in reinforcement learning. The implementation is built with PyTorch and supports training on various Gymnasium environments, with a focus on the CartPole-v1 environment.

Algorithm Description

REINFORCE is a policy gradient method that directly optimizes a policy without using a value function. It belongs to the class of Monte Carlo methods as it uses complete episode returns for updating the policy. The key features of this implementation include:

Monte Carlo Policy Gradient: Updates policy parameters using complete episode returns
Policy Network: Neural network that maps states to action probabilities
Return Calculation: Computes discounted returns from rewards
Return Normalization: Normalizes returns to reduce variance in updates
Gradient Monitoring: Tracks parameter and gradient statistics during training

The Algorithm Steps

Initialize a parameterized policy π(a s; θ)
For each episode:
- Generate a complete trajectory following the current policy
- For each step in the trajectory:
  - Calculate the discounted return from that step onwards
- Update policy parameters using gradient ascent:
  - θ ← θ + α ∇θ log π(at st; θ) Gt
Repeat until convergence

Implementation Details

Network Architecture

The policy network consists of:

Input layer matching state space dimensions
Two hidden layers (32 nodes each) with ReLU activation
One hidden layer (16 nodes) with ReLU activation
Output layer matching action space dimensions with softmax activation

Key Features

Policy Network: Maps states to action probabilities
Stochastic Action Selection: Uses categorical distribution for action sampling
Return Calculation: Computes discounted returns for each step
Return Normalization: Reduces variance in policy updates
Gradient and Parameter Monitoring: Tracks training dynamics with WandB
Evaluation: Periodically evaluates policy performance
Video Recording: Captures agent behavior for visualization

Usage

Prerequisites

Python 3.8+
PyTorch
Gymnasium
Stable-Baselines3 (for ReplayBuffer utility)
Weights & Biases (for logging)
TensorBoard
tqdm, numpy, imageio, cv2

Configuration

The Config class contains all hyperparameters and settings:

class Config:
    # Experiment settings
    exp_name = "DQN-CartPole"  # Can be renamed to "REINFORCE-CartPole"
    seed = 42
    env_id = "CartPole-v1"
    episodes = 2000  # Number of episodes to train
    # Training parameters
    learning_rate = 2e-3
    gamma = 0.99  # Discount factor
    # Logging & saving
    capture_video = True
    save_model = True
    use_wandb = True
    wandb_project = "cleanRL"
    wandb_entity = ""  # Your WandB username/team

License

This project is open source and available under the MIT License.

Source Code

📁 GitHub Repository: Reinforce (Reinforce)

View the complete implementation, training scripts, and documentation on GitHub.

Share on

Twitter Facebook LinkedIn

Yuvraj Singh

REINFORCE

Technical Details

Overview

Algorithm Description

The Algorithm Steps

Implementation Details

Network Architecture

Key Features

Usage

Prerequisites

Configuration

License

Source Code

Share on