CLAP

CLAP

Category: Audio/Speech
Framework: PyTorch
Dataset: Gigaspeech
Created: August 06, 2025

Overview

From scratch implementation of CLAP

Technical Details

  • Framework: PyTorch
  • Dataset: Gigaspeech
  • Category: Audio/Speech

Implementation Details

CLAP : LEARNING AUDIO CONCEPTS FROM NATURAL LANGUAGE SUPERVISION

ModelArgs Hyperparameters

Hyperparameters

Parameter Value Description
epochs 30 Number of training epochs.
text_embeddings 768 Dimensionality of text embeddings.
audio_embeds 2048 Dimensionality of audio embeddings.
block_size 100 Size of input blocks (e.g., sequence length).
batch_size 32 Number of samples per batch.
lr 4e-4 Learning rate for the main model.
device 'cuda:0' Device to run the model on (e.g., GPU).
SAMPLING_RATE 44100 Sampling rate of the audio (in Hz).
N_MELS 64 Number of mel-spectrogram bins.
max_t 500 Maximum time steps for sequences.
n_channels N_MELS Number of channels in the input (same as N_MELS).
window_size 1024 Window size for STFT (Short-Time Fourier Transform).
hop_size 320 Hop size for STFT.
mel_bins N_MELS Number of mel bins (same as N_MELS).
fmin 50 Minimum frequency for mel-spectrogram computation.
fmax 8000 Maximum frequency for mel-spectrogram computation.
output_embeddings 1024 Dimensionality of output embeddings.
head_lr 1e-3 Learning rate for the task-specific head.
audio_encoder_lr 1e-4 Learning rate for the audio encoder.
text_encoder_lr 1e-5 Learning rate for the text encoder.

Dataset

Gigaspeech

Used the โ€˜xsโ€™ snapshot.

Frameworks:

Pytorch

NOTE

The loss was stagged at 2.079 -loge(1/8), that is, the logits tend to be too small for softmax to outputs anythign except uniform probs. Pls let me know where am I making a mistake.

Source Code

๐Ÿ“ GitHub Repository: CLAP

View the complete implementation, training scripts, and documentation on GitHub.