Gemma3

Category: Language Models

Framework: PyTorch

Dataset: Custom

Created: April 20, 2025

GitHub: View Implementation

Overview

From scratch implementation of Gemma3

Trained a small Gemma 3 model (90M) coded and trained from scratch in Pytorch (text only)

Parameter	Value	Description
`batch_size`	64	Number of samples processed before model update
`max_lr`	2.5e-4	Maximum learning rate
`dropout`	0.1	Dropout rate for regularization
`block_size`	256	Sequence length (number of tokens)
`vocab_size`	32000 + 768	vocabulary size
`embeddings_dims`	512	Token embedding dimensionality
`attn_dropout`	0.1	Dropout rate for attention layers
`no_of_heads`	8	Number of attention heads in multi-head attention
`no_of_decoder_layers`	6	Number of decoder layers
`weight_decay_optim`	0.1	Optimizer weight decay
`beta_1`	0.9	Adam optimizer beta1 parameter
`beta_2`	0.95	Adam optimizer beta2 parameter
`no_kv_heads`	2	Number of key/value heads
`scaling_factor`	0.5	Scaling factor for certain operations
`local_block_size`	128	Local attention block size
`base_freq`	10000	Base frequency

Pytorch

Steps (train) = 25000

Val iterations = every 500 steps

Train loss: 2.08 (last step)

Val loss: 1.77

📁 GitHub Repository: Gemma3

View the complete implementation, training scripts, and documentation on GitHub.