VizDoom RL
Overview
DQN agent trained on the VizDoom Basic scenario (VizdoomBasic-v0) using Gymnasium’s VizDoom wrapper. The agent learns to navigate a 3D first-person environment and eliminate enemies using raw pixel observations.
Architecture
3-layer CNN + 2 FC layers:
| Layer | Spec |
|---|---|
| Conv 1 | 32 filters, 8×8, stride 4, ReLU |
| Conv 2 | 32 filters, 4×4, stride 2, ReLU |
| Conv 3 | 64 filters, 3×3, stride 3, ReLU |
| FC 1 | 512 units, ReLU |
| FC 2 | 512 units, ReLU |
| Output | Action space dim |
Image Preprocessing
- RGB → grayscale, channel-first
- Resize to 128×128
- Normalize to [0, 1]
- Handles dict observations (
obs['screen'])
Training Config
| Hyperparameter | Value |
|---|---|
| Total timesteps | 1,000,000 |
| Learning rate | 2e-4 |
| Buffer size | 30,000 |
| Batch size | 128 |
| Gamma | 0.99 |
| Epsilon start/end | 1.0 → 0.05 |
| Exploration fraction | 0.5 |
| Target update freq | 50 steps |
| Optimizer | Adam, MSE loss |
Features
- Target network with polyak averaging (tau)
- SB3 replay buffer for efficient sampling
- W&B logging (episodic return, Q-values, epsilon)
- Periodic evaluation with video export
References
Playing Atari with Deep Reinforcement Learning — Mnih et al., DeepMind 2013