DeepSeekV3

Category: Language Models

Framework: PyTorch

Dataset: TinyStories

Created: August 06, 2025

GitHub: View Implementation

Overview

From scratch implementation of DeepSeekV3

Key Features

Mixture of Experts (MoE)
Attention Mechanism
Transformer Architecture
Memory Optimization

Technical Details

Framework: PyTorch
Dataset: TinyStories
Category: Language Models

Implementation Details

This repository contains a PyTorch implementation of the DeepSeekV3 architecture, trained on the TinyStories dataset. The model is designed for efficient text generation and understanding tasks, leveraging a mixture of experts (MoE) architecture.

So, I trained a DeepSeekV3 (16x4) architecture I coded from ground up.
Trained on TiyStories dataset form HuggingFace consisting of 4.2B tokens for a few steps with gradient accumulation ammounting to 300M tokens.

### Pretraining

Dataset

I used the TinyStories dataset from HuggingFace.

1) Train dataset - 2 M records approx 2) Val dataset - 26K records approx

ModelArgs (Hyperparameters)

Model Configuration (`ModelArgs`)

This dataclass defines hyperparameters and configuration settings for the DeepSeekV3 model, as defined in train.py.

Hyperparameters Overview

Architecture

Training

Regularization

| Parameter | Value | Description | |———–|——-|————-| | attn_dropout | 0.1 | Dropout probability for attention layers | | dropout | 0.1 | General dropout probability |

Optimization

| Parameter | Value | Description | |———–|——-|————-| | weight_decay_optim | 0.1 | L2 regularization strength | | beta_1 | 0.9 | AdamW first momentum factor | | beta_2 | 0.95 | AdamW second momentum factor | | eps | 1e-8 | Epsilon for numerical stability | | loss_scale | 0.3 | Loss scaling factor |

Mixture-of-Experts (MoE)

Hardware & Optimization

Used P100 on Kaggle

Frameworks:

Pytorch

Source Code

📁 GitHub Repository: DeepSeekV3

View the complete implementation, training scripts, and documentation on GitHub.

Share on

Twitter Facebook LinkedIn

Yuvraj Singh