ViT

Computer Vision PyTorch Food-101 (subset) ★ 416

Overview

From-scratch implementation of ViT-B/16 (Vision Transformer). ViT applies a standard transformer encoder directly to sequences of 16×16 image patches, treating each patch as a token — demonstrating that CNNs are not required for competitive vision performance. Based on An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021).

Architecture

  • Patch size: 16×16 pixels
  • Patch embeddings via linear projection
  • Learnable [CLS] token for classification
  • Standard transformer encoder (multi-head self-attention, MLP, LayerNorm)
  • Classification head on [CLS] token output

Training

Hyperparameter Value
Dataset Food-101 subset (3 classes: 255 train / 75 test images)
Variant ViT-B/16

Results

Split Loss
Train 1.20
Test 1.52

Paper

An Image is Worth 16x16 Words — Dosovitskiy et al., 2021