VAE

VAE

Category: Computer Vision
Framework: PyTorch
Dataset: MNIST
Created: June 17, 2025

Overview

From scratch implementation of VAE

Technical Details

  • Framework: PyTorch
  • Dataset: MNIST
  • Category: Computer Vision

Implementation Details

I implemented a Variational Autoencoder Architecture from Scratch using PyTorch on the CelebA dataset for high-resolution face generation and reconstruction.

Auto-Encoding Variational Bayes

Results

Original vs Reconstructed Images

The following images show the comparison between original CelebA face images (top row) and their reconstructions by the VAE (bottom row):

πŸ–ΌοΈ View Results > Note: If images don’t load, please check the data/ folder in this repository:

  • data/image.png - Reconstruction comparison results
  • data/losses.jpg - Training loss curves
  • data/arithmetic.jpg - Latent space visualizations
  • data/samples.jpg - Generated sample faces from latent space

VAE: Original (top) vs Reconstructed (bottom) - Shows the model’s ability to reconstruct high-resolution face images from the latent space representation.

Training Progress

πŸ“Š View Training Loss Curves

Training and validation losses over epochs showing convergence of reconstruction and KL divergence losses.

Latent Space Arithmetic

πŸ”— View πŸ”’ View Latent Arithmetic

Latent space interpolation and arithmetic operations demonstrating the smooth and meaningful latent representations learned by the VAE.

Generated Samples

🎨 View Generated Samples

Random samples generated from the latent space showing the diversity and quality of faces that the VAE can produce.

Model Hyperparameters

Parameter Value Description
input_dim 3 Input channels (RGB color images).
hidden_dim 128 Hidden dimension for convolutional layers.
output_dim 32 Latent space dimension (bottleneck).
batch_size 32 The number of samples processed before the model is updated.
learning_rate 0.0005 Learning rate for Adam optimizer.
epochs 200 Number of training epochs.
leaky_relu 0.01 Negative slope for LeakyReLU activation.
image_size 128x128 Input image resolution for CelebA faces.

Dataset

CelebA: Large-scale CelebFaces Attributes Dataset

  • 202,599 face images of celebrities
  • High-resolution RGB images (128x128 for this implementation)
  • Rich variety of facial expressions, poses, and lighting conditions
  • Dataset split: 80% training, 20% validation

Data Preprocessing:

  • Resize to 128x128 pixels
  • Convert to RGB tensors
  • Normalize to [0, 1] range
  • No additional data augmentation to preserve face structure

Frameworks:

Pytorch

Architecture

Encoder:

  • 4 Convolutional layers with LeakyReLU activation
  • Progressive channel increase: 3 β†’ 128 β†’ 256 β†’ 256 β†’ 256
  • Stride 2 for downsampling to reduce spatial dimensions
  • Flatten and linear layers for mean and log variance (reparameterization trick)
  • Output: 32-dimensional latent space

Decoder:

  • Linear layer to expand 32D latent representation to 262,144 dimensions
  • Reshape to 256 Γ— 32 Γ— 32 feature maps
  • 4 Transposed Convolutional layers with LeakyReLU activation
  • Progressive channel decrease: 256 β†’ 256 β†’ 256 β†’ 128 β†’ 3
  • Stride 2 for upsampling to reconstruct 128Γ—128 images
  • Sigmoid activation for final RGB output [0, 1]

Training Details

Optimizer: Adam with learning rate 0.0005
Loss Function: Reconstruction Loss (MSE) + KL Divergence
Training/Validation Split: 80/20
Device: CUDA (with automatic CPU fallback) Progress Tracking: tqdm progress bars with real-time loss monitoring Logging: Weights & Biases (wandb) for experiment tracking

VAE-Specific Components

Reparameterization Trick: Enables backpropagation through stochastic sampling
KL Divergence: Regularizes latent space to follow standard normal distribution
Latent Space: 32-dimensional for rich face feature representation Loss Weighting: Balanced reconstruction and KL terms for stable training

Final Training Metrics:

  • Reconstruction Loss: MSE between original and reconstructed images
  • KL Loss: Ensures latent variables follow standard normal distribution
  • Total Loss: Weighted combination optimizing both reconstruction quality and latent space structure

Files in Repository

Notebooks

  • celeba-variational-autoencoders.ipynb - Main implementation with CelebA dataset
  • variational-autoencoders.ipynb - Original MNIST implementation
  • model.ipynb - Model architecture experiments
  • inference_vae.py - Inference script for generating new faces

Data

  • data/image.png - Sample reconstruction results visualization
  • data/losses.jpg - Training loss curves and convergence plots
  • data/arithmetic.jpg - Latent space interpolation and arithmetic examples
  • data/samples.jpg - Generated sample faces from latent space

Image Loading Issues? If images don’t display in your markdown viewer:

  1. Navigate to the data/ folder directly to view images
  2. Try using absolute paths: ./data/image.png
  3. Some markdown viewers require the repository to be cloned locally
  4. GitHub should display the images correctly in the web interface

Model Checkpoints

  • vae_checkpoint_epoch_240.pth - Trained model weights after 240 epochs

Usage

Training the Model

  1. Setup Environment:
    pip install torch torchvision tqdm wandb torchinfo matplotlib
    
  2. Prepare CelebA Dataset:
    • Download CelebA dataset
    • Place images in /kaggle/input/celeba-dataset/img_align_celeba/img_align_celeba/
    • Or modify the image_dir path in the notebook
  3. Run Training:
    • Open celeba-variational-autoencoders.ipynb
    • Execute cells sequentially
    • Monitor progress with tqdm progress bars
    • Track metrics on Weights & Biases

Inference

Use the trained model for:

  • Image Reconstruction: Encode and decode existing faces
  • Face Generation: Sample from latent space to generate new faces
  • Latent Interpolation: Smooth transitions between faces
# Load trained model
checkpoint = torch.load('vae_checkpoint_epoch_240.pth')
model.load_state_dict(checkpoint['model_state_dict'])

# Generate new faces
with torch.no_grad():
    z = torch.randn(16, 32).to(device)  # Sample from latent space
    generated_faces = model.decoder(z)

Frameworks:

PyTorch

Source Code

πŸ“ GitHub Repository: VAE

View the complete implementation, training scripts, and documentation on GitHub.