Whisper

Category: Audio/Speech

Framework: PyTorch

Dataset: Gigaspeech

Created: April 25, 2025

GitHub: View Implementation

Overview

From scratch implementation of Whisper

Trained a small whisper model coded and trained from scratch in Pytorch

Parameter	Value	Description
`batch_size`	64	The number of samples processed before the model is updated.
`max_lr`	2e-4	Maximum learning rate.
`dropout`	0.1	Dropout rate for regularization.
`epochs`	10	Number of training epochs.
`block_size`	64	Sequence length (number of tokens or time steps).
`tgt_vocab_size`	50262	Size of the target vocabulary.
`embeddings_dims`	384	Dimensionality of token embeddings.
`attn_dropout`	0.1	Dropout rate for attention layers.
`no_of_heads`	6	Number of attention heads in multi-head attention.
`no_of_decoder_layers`	6	Number of decoder layers in the model.
`weight_decay_optim`	0.01	Weight decay for the optimizer.
`log_mel_features`	80	Number of Mel spectrogram features.
`kernel_size`	3	Kernel size for convolutional layers.
`stride`	2	Stride for convolutional layers.
`sr`	16000	Sampling rate of the audio.
`device`	‘cuda:0’	Device to run the model on (e.g., GPU).
`SAMPLING_RATE`	16000	Sampling rate of the audio.
`N_MELS`	80	Number of Mel bins in the spectrogram.
`WINDOW_DURATION`	0.025	Duration of the analysis window in seconds (25 ms).
`STRIDE_DURATION`	0.010	Stride between consecutive windows in seconds (10 ms).
`max_t`	500	Maximum time steps in the spectrogram.
`n_channels`	80	Number of channels in the input spectrogram.
`hidden_dim`	4 * `embeddings_dims`	Number of neurons in the feed-forward network (FFN).

Used the ‘xs’ snapshot.

Pytorch

Epochs (train) = 10

Val iterations = every epoch

📁 GitHub Repository: Whisper

View the complete implementation, training scripts, and documentation on GitHub.