TTS
TTS
Overview
From scratch implementation of TTS
Technical Details
- Framework: PyTorch
- Dataset: Gigaspeech
- Category: Audio/Speech
Implementation Details
Trained a small transformer based TTS model coded and trained from scratch in Pytorch
(will be uploading the implementation of Wavenet soon)
Neural Speech Synthesis with Transformer Network
Model Hyperparameters
Core Architecture
| Parameter | Value | Description |
|---|---|---|
batch_size |
32 | Number of samples per batch |
max_lr |
6e-4 | Maximum learning rate |
dropout |
0.1 | General dropout rate |
epochs |
10 | Total training epochs |
block_size |
80 | Sequence length in tokens |
src_vocab_size |
dynamic | Source vocabulary size |
phenome_embeddings_dims |
512 | Phoneme embedding dimension |
embeddings_dims |
512 | Main embedding dimension |
prenet_encoder_embeddings_dims |
512 | Encoder prenet dimension |
embeddings_dims_decoder |
256 | Decoder-specific embedding dimension |
attn_dropout |
0.1 | Attention dropout rate |
no_of_heads |
4 | Attention heads per layer |
no_of_decoder_layers |
8 | Number of decoder layers |
weight_decay_optim |
0.01 | Optimizer weight decay |
hidden_dim |
2048 (4×512) | FFN hidden dimension |
clip |
1.0 | Gradient clipping threshold |
Audio Processing
| Parameter | Value | Description |
|————————-|———-|———————————————-|
| log_mel_features | 80 | Mel spectrogram channels |
| kernel_size | 5 | Convolution kernel size |
| stride | (2,10) | Convolution stride (time, freq) |
| sr, SAMPLING_RATE | 16000 | Audio sample rate (Hz) |
| N_MELS | 80 | Number of Mel bands |
| WINDOW_DURATION | 0.050s | Analysis window duration |
| STRIDE_DURATION | 0.0125s | Window stride duration |
| max_t | 512 | Maximum spectrogram time steps |
| n_channels | 80 | Input spectrogram channels |
Dataset
Gigaspeech (can be used)
Frameworks:
Pytorch
Epochs/Steps
Steps (train) = 150
Val iterations = every 50 steps
Loss Curves
Source Code
📁 GitHub Repository: TTS
View the complete implementation, training scripts, and documentation on GitHub.