Smol Transformer

Published:

Project Overview

Try it on Hugging Face

Project Overview

A compact implementation of an Encoder-Decoder Transformer for sequence-to-sequence translation tasks. This project implements a translation model from English to Hindi using the Samanantar dataset.

Technical Details

  • Type: SmolHub Playground Project
  • Framework: PyTorch
  • Category: Experimental AI/ML
  • Repository: Smol Transformer

A compact implementation of an Encoder-Decoder Transformer for sequence-to-sequence translation tasks. This project implements a translation model from English to Hindi using the Samanantar dataset.

Features

  • Encoder-Decoder Architecture: Full transformer implementation with separate encoder and decoder
  • Sinusoidal Positional Embeddings: Learnable position encoding for better sequence understanding
  • Multi-Head Attention: Self-attention and cross-attention mechanisms
  • Advanced Generation: Top-K sampling and beam search for text generation
  • Mixed Precision Training: Automatic mixed precision for faster training
  • Gradient Accumulation: Support for large effective batch sizes
  • Comprehensive Logging: WandB integration for experiment tracking

Architecture

Model Components

  1. Encoder:
    • Multi-layer encoder blocks with self-attention
    • Sinusoidal positional embeddings
    • Layer normalization and feed-forward networks
  2. Decoder:
    • Multi-layer decoder blocks with masked self-attention
    • Cross-attention to encoder outputs
    • Autoregressive generation capability
  3. Attention Mechanisms:
    • Masked Multi-Head Attention (for decoder self-attention)
    • Full Multi-Head Attention (for encoder self-attention)
    • Cross Multi-Head Attention (for encoder-decoder attention)

Installation

# Clone the repository
cd SmolTransformer

# Install dependencies
chmod +x install.sh
./install.sh

Configuration

The model configuration can be modified in config.py:

@dataclass
class ModelArgs:
    block_size: int = 512           # Maximum sequence length
    batch_size: int = 32            # Training batch size
    embeddings_dims: int = 512      # Model embedding dimensions
    no_of_heads: int = 8            # Number of attention heads
    no_of_decoder_layers: int = 6   # Number of decoder layers
    max_lr: float = 6e-4           # Maximum learning rate
    # ... additional parameters

Usage

Training

python trainer.py

Web Application

Launch the interactive Gradio web interface:

python launch_app.py

The app will be available at http://localhost:7860 and provides:

  • Interactive Translation: English to Hindi translation interface
  • Multiple Generation Methods: Top-K sampling and beam search
  • Real-time Parameter Adjustment: Temperature, top-k, beam width controls
  • Model Loading: Load trained checkpoints
  • Example Translations: Pre-built examples to try

Web App Features

  • ๐ŸŒ User-friendly Interface: Clean, intuitive web UI
  • ๐Ÿ”ง Configurable Parameters: Adjust generation settings in real-time
  • ๐Ÿ“Š Model Information: Display architecture and training details
  • ๐ŸŽฏ Multiple Methods: Compare Top-K sampling vs Beam search
  • ๐Ÿ’พ Checkpoint Loading: Load your trained models
  • ๐Ÿ“ Examples: Built-in examples to test the model

Key Training Features

  • Gradient Accumulation: Automatically calculated based on total_batch_size
  • Learning Rate Scheduling: Warmup + cosine decay
  • Mixed Precision: Automatic mixed precision training
  • Checkpointing: Regular model checkpoints saved to checkpoints/
  • Generation Samples: Periodic text generation during training

Dataset

The model is trained on the Hindi-English Samanantar dataset:

  • Source: English text
  • Target: Hindi text
  • Preprocessing: Automatic tokenization with IndicBARTSS tokenizer

File Structure

SmolTransformer/
โ”œโ”€โ”€ config.py          # Model configuration and hyperparameters
โ”œโ”€โ”€ model.py           # Transformer model implementation
โ”œโ”€โ”€ data.py            # Dataset loading and preprocessing
โ”œโ”€โ”€ tokenizer.py       # Tokenizer setup and utilities
โ”œโ”€โ”€ trainer.py         # Training loop and utilities
โ”œโ”€โ”€ inference.py       # Text generation functions
โ”œโ”€โ”€ install.sh         # Installation script
โ”œโ”€โ”€ README.md          # This file
โ”œโ”€โ”€ checkpoints/       # Model checkpoints
โ”œโ”€โ”€ generated_data/    # Generated text samples
โ”œโ”€โ”€ gradio/            # Gradio interface (optional)
โ””โ”€โ”€ old/              # Backup files

Model Parameters

  • Parameters: ~25M (configurable)
  • Context Length: 512 tokens
  • Vocabulary: IndicBARTSS tokenizer (~30K tokens)
  • Architecture: 6-layer encoder-decoder

Training Features

Optimization

  • Optimizer: AdamW with weight decay
  • Learning Rate: 6e-4 with warmup and cosine decay
  • Gradient Clipping: 1.0 max norm
  • Mixed Precision: Automatic FP16 training

Monitoring

  • WandB Integration: Comprehensive experiment tracking
  • Metrics: Loss, perplexity, gradient norms
  • Generation Samples: Regular text generation examples
  • Validation: Periodic validation loss evaluation

Generation Methods

  • Top-K Sampling: Configurable top-k and temperature
  • Beam Search: Multi-beam search with configurable width
  • Repetition Penalty: Reduces repetitive generation

Example Usage

from model import Transformer
from tokenizer import initialize_tokenizer
from inference import topk_sampling, beam_search_corrected

# Initialize model and tokenizer
tokenizer = initialize_tokenizer()
model = Transformer(src_vocab_size=len(tokenizer), tgt_vocab_size=len(tokenizer))

# Generate text
prompt = "Hello, how are you?"
generated = topk_sampling(model, prompt, tokenizer, device="cuda", max_length=50)
print(generated)

Customization

Adding New Datasets

Modify data.py to load your dataset:

def load_datasets(token, sample_size=None):
    # Load your custom dataset here
    dataset = load_dataset("your_dataset")
    return dataset

Changing Model Size

Adjust parameters in config.py:

embeddings_dims = 768  # Larger model
no_of_heads = 12       # More attention heads
no_of_decoder_layers = 12  # Deeper model

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers
  • Datasets
  • WandB
  • CUDA-capable GPU (recommended)

Model Resources

License

This project is open source and available under the MIT License.

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

Source Code

๐Ÿ“ GitHub Repository: Smol Transformer

View the complete implementation, documentation, and examples on GitHub.

Interactive Features

๐ŸŽฎ Web Interface: This project includes a Gradio-based web interface for easy interaction and experimentation.

๐Ÿ“ฑ User-Friendly: Simple, intuitive interface perfect for testing and learning.


This project is part of the SmolHub Playground collection - a space for experimental AI models and proof-of-concept implementations.