GRPO

Policy-Based PyTorch Custom

Overview

From-scratch implementation of Group Relative Policy Optimization (GRPO), the RL training objective introduced in DeepSeek-R1. Unlike PPO which uses a value network as baseline, GRPO estimates the baseline from the mean reward of a group of sampled outputs — removing the critic entirely and making RL training of LLMs significantly cheaper.

Algorithm

For each prompt, sample G outputs from the current policy. The group advantage for output i is:

A_i = (r_i - mean(r)) / std(r)

The policy is then updated with a clipped surrogate objective identical to PPO’s, but using group-relative advantages instead of GAE.

Key Properties

No value network: baseline is the within-group mean reward
Group size G: controls variance of the advantage estimate
KL penalty: constrains the policy from drifting too far from a reference model
Clipped surrogate: same epsilon-clip as PPO for stability

Implementation

Implemented as a Jupyter notebook (train.ipynb) covering the full GRPO update loop, advantage normalization, and the KL-divergence penalty term.

References

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — DeepSeek-AI, 2025

Yuvraj Singh

Overview

Algorithm

Key Properties

Implementation

References