Smol Transformer
Published:
Project Overview
Project Overview
A compact implementation of an Encoder-Decoder Transformer for sequence-to-sequence translation tasks. This project implements a translation model from English to Hindi using the Samanantar dataset.
Technical Details
- Type: SmolHub Playground Project
- Framework: PyTorch
- Category: Experimental AI/ML
- Repository: Smol Transformer
A compact implementation of an Encoder-Decoder Transformer for sequence-to-sequence translation tasks. This project implements a translation model from English to Hindi using the Samanantar dataset.
Features
- Encoder-Decoder Architecture: Full transformer implementation with separate encoder and decoder
- Sinusoidal Positional Embeddings: Learnable position encoding for better sequence understanding
- Multi-Head Attention: Self-attention and cross-attention mechanisms
- Advanced Generation: Top-K sampling and beam search for text generation
- Mixed Precision Training: Automatic mixed precision for faster training
- Gradient Accumulation: Support for large effective batch sizes
- Comprehensive Logging: WandB integration for experiment tracking
Architecture
Model Components
- Encoder:
- Multi-layer encoder blocks with self-attention
- Sinusoidal positional embeddings
- Layer normalization and feed-forward networks
- Decoder:
- Multi-layer decoder blocks with masked self-attention
- Cross-attention to encoder outputs
- Autoregressive generation capability
- Attention Mechanisms:
- Masked Multi-Head Attention (for decoder self-attention)
- Full Multi-Head Attention (for encoder self-attention)
- Cross Multi-Head Attention (for encoder-decoder attention)
Installation
# Clone the repository
cd SmolTransformer
# Install dependencies
chmod +x install.sh
./install.sh
Configuration
The model configuration can be modified in config.py
:
@dataclass
class ModelArgs:
block_size: int = 512 # Maximum sequence length
batch_size: int = 32 # Training batch size
embeddings_dims: int = 512 # Model embedding dimensions
no_of_heads: int = 8 # Number of attention heads
no_of_decoder_layers: int = 6 # Number of decoder layers
max_lr: float = 6e-4 # Maximum learning rate
# ... additional parameters
Usage
Training
python trainer.py
Web Application
Launch the interactive Gradio web interface:
python launch_app.py
The app will be available at http://localhost:7860
and provides:
- Interactive Translation: English to Hindi translation interface
- Multiple Generation Methods: Top-K sampling and beam search
- Real-time Parameter Adjustment: Temperature, top-k, beam width controls
- Model Loading: Load trained checkpoints
- Example Translations: Pre-built examples to try
Web App Features
- ๐ User-friendly Interface: Clean, intuitive web UI
- ๐ง Configurable Parameters: Adjust generation settings in real-time
- ๐ Model Information: Display architecture and training details
- ๐ฏ Multiple Methods: Compare Top-K sampling vs Beam search
- ๐พ Checkpoint Loading: Load your trained models
- ๐ Examples: Built-in examples to test the model
Key Training Features
- Gradient Accumulation: Automatically calculated based on
total_batch_size
- Learning Rate Scheduling: Warmup + cosine decay
- Mixed Precision: Automatic mixed precision training
- Checkpointing: Regular model checkpoints saved to
checkpoints/
- Generation Samples: Periodic text generation during training
Dataset
The model is trained on the Hindi-English Samanantar dataset:
- Source: English text
- Target: Hindi text
- Preprocessing: Automatic tokenization with IndicBARTSS tokenizer
File Structure
SmolTransformer/
โโโ config.py # Model configuration and hyperparameters
โโโ model.py # Transformer model implementation
โโโ data.py # Dataset loading and preprocessing
โโโ tokenizer.py # Tokenizer setup and utilities
โโโ trainer.py # Training loop and utilities
โโโ inference.py # Text generation functions
โโโ install.sh # Installation script
โโโ README.md # This file
โโโ checkpoints/ # Model checkpoints
โโโ generated_data/ # Generated text samples
โโโ gradio/ # Gradio interface (optional)
โโโ old/ # Backup files
Model Parameters
- Parameters: ~25M (configurable)
- Context Length: 512 tokens
- Vocabulary: IndicBARTSS tokenizer (~30K tokens)
- Architecture: 6-layer encoder-decoder
Training Features
Optimization
- Optimizer: AdamW with weight decay
- Learning Rate: 6e-4 with warmup and cosine decay
- Gradient Clipping: 1.0 max norm
- Mixed Precision: Automatic FP16 training
Monitoring
- WandB Integration: Comprehensive experiment tracking
- Metrics: Loss, perplexity, gradient norms
- Generation Samples: Regular text generation examples
- Validation: Periodic validation loss evaluation
Generation Methods
- Top-K Sampling: Configurable top-k and temperature
- Beam Search: Multi-beam search with configurable width
- Repetition Penalty: Reduces repetitive generation
Example Usage
from model import Transformer
from tokenizer import initialize_tokenizer
from inference import topk_sampling, beam_search_corrected
# Initialize model and tokenizer
tokenizer = initialize_tokenizer()
model = Transformer(src_vocab_size=len(tokenizer), tgt_vocab_size=len(tokenizer))
# Generate text
prompt = "Hello, how are you?"
generated = topk_sampling(model, prompt, tokenizer, device="cuda", max_length=50)
print(generated)
Customization
Adding New Datasets
Modify data.py
to load your dataset:
def load_datasets(token, sample_size=None):
# Load your custom dataset here
dataset = load_dataset("your_dataset")
return dataset
Changing Model Size
Adjust parameters in config.py
:
embeddings_dims = 768 # Larger model
no_of_heads = 12 # More attention heads
no_of_decoder_layers = 12 # Deeper model
Requirements
- Python 3.8+
- PyTorch 2.0+
- Transformers
- Datasets
- WandB
- CUDA-capable GPU (recommended)
Model Resources
- Hugging Face Model: YuvrajSingh9886/SmolTransformer
- Training Report: Weights & Biases Report
License
This project is open source and available under the MIT License.
Contributing
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
Source Code
๐ GitHub Repository: Smol Transformer
View the complete implementation, documentation, and examples on GitHub.
Interactive Features
๐ฎ Web Interface: This project includes a Gradio-based web interface for easy interaction and experimentation.
๐ฑ User-Friendly: Simple, intuitive interface perfect for testing and learning.
This project is part of the SmolHub Playground collection - a space for experimental AI models and proof-of-concept implementations.