TD3
Published:
TD3
Implementation of TD3 reinforcement learning algorithm
Technical Details
- Framework: PyTorch
- Environment: MuJoCo
- Category: Actor-Critic Methods
This directory contains implementations of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for various continuous control environments.
Overview
TD3 is an advanced off-policy actor-critic algorithm designed to address the overestimation bias in DDPG. It introduces three critical improvements:
- Twin Critics: Uses two Q-value networks to reduce overestimation bias through taking the minimum Q-value.
- Delayed Policy Updates: Updates the policy less frequently than the critics to reduce variance.
- Target Policy Smoothing: Adds noise to the target actions to make the algorithm more robust to errors.
Key features of this implementation:
- Actor-Critic architecture with twin critics
- Delayed policy updates
- Target policy smoothing regularization
- Experience replay buffer for stable learning
- Soft target network updates using Polyak averaging
- Exploration using additive noise
- Support for different continuous control environments
Environments
This implementation includes support for the following environments:
- Pendulum-v1: A classic control problem where the goal is to balance a pendulum in an upright position.
- BipedalWalker-v3: A more challenging environment where a 2D biped robot must walk forward without falling.
- HalfCheetah-v5: A MuJoCo environment where a 2D cheetah-like robot must run forward as fast as possible.
Configuration
Each implementation includes a Config
class that specifies the hyperparameters for training. You can modify these parameters to experiment with different settings:
exp_name
: Name of the experimentseed
: Random seed for reproducibilityenv_id
: ID of the Gymnasium environmentpolicy_noise
: Standard deviation of noise added to target policytotal_timesteps
: Total number of training stepslearning_rate
: Learning rate for the optimizerbuffer_size
: Size of the replay buffergamma
: Discount factortau
: Soft update coefficient for target networksbatch_size
: Batch size for trainingclip
: Clipping range for target policy smoothing noiseexploration_fraction
: Fraction of total timesteps for explorationlearning_starts
: Number of timesteps before learning startstrain_frequency
: Frequency of updates to the networks
Architecture
The TD3 implementation includes:
- Actor Network: Determines the best action in a given state
- Twin Critic Networks: Two separate networks that evaluate the Q-value of state-action pairs
- Target Networks: Slowly updated copies of both actor and critics for stability
- Replay Buffer: Stores and samples transitions for training
- Noise Process: Adds exploration noise to actions during training
Improvements Over DDPG
TD3 addresses several shortcomings of DDPG:
- Reducing Overestimation Bias: By using the minimum of two critics, TD3 helps mitigate the overestimation bias that plagues many Q-learning algorithms.
- Stabilized Learning: Delayed policy updates (updating the policy less frequently than the critics) help reduce variance and stabilize learning.
- Smoother Target Values: Adding noise to target actions smooths the value function, making the learning process more robust to errors.
Results
The implementation includes a video recording (TD3_BipedalWalker.mp4
) that demonstrates the performance of the trained TD3 agent on the BipedalWalker environment.
Training Visualizations
BipedalWalker Agent
Here’s a GIF showing the trained TD3 agent navigating the BipedalWalker environment:
HalfCheetah Training
The following graph shows the training losses for the HalfCheetah environment:
- CleanRL - Inspiration for code structure and implementation style
Source Code
📁 GitHub Repository: TD3 (TD3)
View the complete implementation, training scripts, and documentation on GitHub.