DDPG
Published:
DDPG
Implementation of DDPG reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: MuJoCo
- Category: Actor-Critic Methods
This directory contains implementations of the Deep Deterministic Policy Gradient (DDPG) algorithm for various continuous control environments.
Overview
DDPG is an off-policy actor-critic algorithm designed for continuous action spaces. It combines insights from both Deep Q-Networks (DQN) and policy gradient methods to learn policies in high-dimensional, continuous action spaces.
Key features of this implementation:
- Actor-Critic architecture with separate target networks
- Experience replay buffer for stable learning
- Soft target network updates using Polyak averaging
- Exploration using Ornstein-Uhlenbeck noise process
- Support for different continuous control environments
Environments
This implementation includes support for the following environments:
- Pendulum-v1: A classic control problem where the goal is to balance a pendulum in an upright position.
- BipedalWalker-v3: A more challenging environment where a 2D biped robot must walk forward without falling.
- HalfCheetah-v5: A MuJoCo environment where a 2D cheetah-like robot must run forward as fast as possible.
Configuration
Each implementation includes a Config
class that specifies the hyperparameters for training. You can modify these parameters to experiment with different settings:
exp_name
: Name of the experimentseed
: Random seed for reproducibilityenv_id
: ID of the Gymnasium environmenttotal_timesteps
: Total number of training stepslearning_rate
: Learning rate for the optimizerbuffer_size
: Size of the replay buffergamma
: Discount factortau
: Soft update coefficient for target networksbatch_size
: Batch size for trainingexploration_fraction
: Fraction of total timesteps for explorationlearning_starts
: Number of timesteps before learning starts
Architecture
The DDPG implementation includes:
- Actor Network: Determines the best action in a given state
- Critic Network: Evaluates the Q-value of state-action pairs
- Target Networks: Slowly updated copies of both actor and critic for stability
- Replay Buffer: Stores and samples transitions for training
- Noise Process: Adds exploration noise to actions
Logging and Monitoring
Training progress is logged using:
- TensorBoard: Local visualization of training metrics
- Weights & Biases (WandB): Cloud-based experiment tracking (optional)
- Video Capture: Records videos of agent performance at intervals
Dependencies
- PyTorch
- Gymnasium
- NumPy
- Stable-Baselines3 (for the replay buffer)
- WandB (optional, for experiment tracking)
- TensorBoard
- Tqdm
References
- Continuous Control with Deep Reinforcement Learning - Original DDPG paper by Lillicrap et al.
- CleanRL - Inspiration for code structure and implementation style
Source Code
📁 GitHub Repository: DDPG (DDPG)
View the complete implementation, training scripts, and documentation on GitHub.