TTS

Category: Audio/Speech

Framework: PyTorch

Dataset: Gigaspeech

Created: March 26, 2025

GitHub: View Implementation

Overview

From scratch implementation of TTS

Trained a small transformer based TTS model coded and trained from scratch in Pytorch

(will be uploading the implementation of Wavenet soon)

Parameter	Value	Description
`batch_size`	32	Number of samples per batch
`max_lr`	6e-4	Maximum learning rate
`dropout`	0.1	General dropout rate
`epochs`	10	Total training epochs
`block_size`	80	Sequence length in tokens
`src_vocab_size`	dynamic	Source vocabulary size
`phenome_embeddings_dims`	512	Phoneme embedding dimension
`embeddings_dims`	512	Main embedding dimension
`prenet_encoder_embeddings_dims`	512	Encoder prenet dimension
`embeddings_dims_decoder`	256	Decoder-specific embedding dimension
`attn_dropout`	0.1	Attention dropout rate
`no_of_heads`	4	Attention heads per layer
`no_of_decoder_layers`	8	Number of decoder layers
`weight_decay_optim`	0.01	Optimizer weight decay
`hidden_dim`	2048 (4×512)	FFN hidden dimension
`clip`	1.0	Gradient clipping threshold

Gigaspeech (can be used)

Pytorch

Steps (train) = 150

Val iterations = every 50 steps

📁 GitHub Repository: TTS

View the complete implementation, training scripts, and documentation on GitHub.