DPO

Category: Fine-tuning

Framework: PyTorch

Dataset: UltraFeedback

Created: April 04, 2025

GitHub: View Implementation

Overview

From scratch implementation of DPO

I Trained Qwen0.5B-Instruct using Direct Preference Optimization in Pytorch

Parameter	Value	Description
`batch_size`	2	The number of samples processed before the model is updated.
`max_lr`	1e-6	Maximum learning rate.
`device`	‘cuda:0’	The device to run the model on (e.g., ‘cuda:0’ for GPU).

Pytorch

Iterations (train) = 3000

Val iterations = every 20

Train loss - 0.67

Val loss - 0.68

📁 GitHub Repository: DPO

View the complete implementation, training scripts, and documentation on GitHub.