Skip to content

7: Vision Transformers for Action Recognition

📋 Overview

This assignment focuses on implementing and training Vision Transformers (ViT) for action recognition on the KTH-Actions dataset. The project explores transformer-based architectures for video classification, comparing different patch sizes and evaluating performance against RNN models from Assignment 3.

ViT Action Recognition Output
Sample output from the Vision Transformer on KTH-Actions (visualization)

🎯 Objectives

  • Implement a Vision Transformer (ViT) for action recognition on video sequences
  • Process video frames by breaking them into patches and jointly processing with transformers
  • Compare different patch sizes and their impact on model performance
  • Evaluate ViT performance against RNN models from Assignment 3
  • Explore Video Vision Transformer (ViViT) with Space-Time attention (extra credit)

📊 Dataset

KTH-Actions Dataset - 6-class action recognition dataset - Task: Action recognition from video sequences - Classes: walking, jogging, running, boxing, handwaving, handclapping - Image size: 64×64×1 (grayscale) - Frame processing: - Maximum 80 frames per sequence - Temporal slicing with step size of 8 (resulting in 10 frames per training sample) - Random temporal sampling to handle dataset disparities (empty frames) - Split: - Training: Person IDs 0-16 - Testing: Person IDs 17-25

The dataset is located at /home/nfs/inf6/data/datasets/kth_actions/processed/

🏗️ Models Implemented

Vision Transformer (ViT)

A transformer-based architecture for video action recognition:

Architecture Components: - Patchifier: Breaks each frame into patches (configurable patch size) - Patch Projection: Projects patches to token dimension with LayerNorm - CLS Token: Learnable classification token for each frame - Positional Encoding: Adds positional information to tokens - Transformer Blocks: Stack of transformer encoder blocks with: - Multi-Head Self-Attention - MLP (Feed-Forward Network) - Residual connections and LayerNorm - Classifier: Linear layer for final classification

Key Features: - Processes video sequences frame-by-frame - Each frame is divided into patches and processed independently - CLS tokens from all frames are averaged for final classification - Supports configurable patch sizes, token dimensions, and number of layers

Default Configuration: - Patch size: 16×16 (configurable: 8, 16, 32, 64) - Token dimension: 192 - Attention dimension: 192 - Number of heads: 4 - MLP size: 768 - Number of transformer layers: 6 - Number of classes: 6

Multi-Head Self-Attention

Implements scaled dot-product attention with multiple heads: - Efficient head splitting and merging for batch processing - Supports 4D input tensors (batch, sequence, tokens, dimensions) - Attention maps can be extracted for visualization

Transformer Block

Standard transformer encoder block: - Multi-Head Self-Attention - Residual connections - Layer Normalization - MLP with GELU activation - Dropout for regularization

🔬 Experiments

The project includes experiments comparing different configurations:

Configuration Patch Size Epochs Token Dim MLP Size Layers
ViT_patch_size_8 8×8 60 128 512 4
ViT_patch_size_16_epochs_60 16×16 60 192 768 6
ViT_patch_size_16_epochs_100 16×16 100 192 768 6
ViT_patch_size_32 32×32 60 - - -
ViT_patch_size_64 64×64 60 - - -

Training Configuration

  • Optimizer: Adam
  • Learning rate: 3e-4
  • Batch size: 32
  • Epochs: 60-100 (configurable)
  • Loss function: CrossEntropyLoss
  • Scheduler: StepLR (step_size=10, gamma=1/3, optional)
  • Validation: Evaluated on test set

🛠️ Key Features

Data Preprocessing

Temporal Processing: - Random temporal sampling: Selects 80 frames with random start index from each sequence - Temporal slicing: Samples every 8th frame (resulting in 10 frames per sample) - Handles dataset disparities by avoiding empty frames

Spatial Augmentations (Training): - Random horizontal flip (p=0.5) - Random rotation (±25 degrees)

Temporal Augmentations (Training): - Random temporal sampling (slicing step=8) - Random temporal reversal (p=0.3)

Test Transforms: - Temporal sampling only (no augmentation)

Training Infrastructure

  • TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
  • Model checkpointing: Saves best models with training configurations
  • Progress tracking: Real-time training progress with tqdm
  • Evaluation metrics: Accuracy, loss tracking
  • Configuration management: YAML-based config files for experiment tracking
  • Reproducibility: Fixed random seeds for consistent results

DataLoader

KTHActionDataset: - Handles video sequence loading - Supports train/test splits based on person IDs - Random frame selection within sequences - Padding for sequences shorter than max_frames - Grayscale image conversion and resizing

📁 Project Structure

Assignment7/
├── Assignment7.ipynb          # Main assignment notebook
├── Session7.ipynb             # Lab session materials
├── models.py                  # ViT and transformer components
├── trainer.py                 # Training script
├── utils.py                   # Utility functions (training, evaluation, visualization)
├── dataloader.py              # KTHActionDataset implementation
├── transformations.py         # Data augmentation transforms
├── configs/                   # Experiment configuration files
│   ├── ViT_patch_size_16_epochs_100.yaml
│   └── ViT_patch_size_16_epochs_60.yaml
├── tboard_logs/               # TensorBoard logs
│   ├── ViT_patch_size_8_epochs_60/
│   ├── ViT_patch_size_16_epochs_60/
│   ├── ViT_patch_size_16_epochs_100/
│   ├── ViT_patch_size_32_epochs_60/
│   └── ViT_patch_size_64_epochs_60/
└── resources/                 # Reference images and documentation
    ├── vit_img.png
    ├── seminar.png
    └── ...

📈 Analysis & Results

Model Comparison

The notebook includes comprehensive analysis: - Learning curves: Training vs validation loss over epochs - Accuracy metrics: Overall classification accuracy - Patch size comparison: Performance across different patch sizes - Comparison with RNN models: Evaluation against Assignment 3 results

Key Findings

  1. Patch Size Impact: Smaller patch sizes (8×8) provide more tokens per frame but increase computational cost
  2. Temporal Processing: Averaging CLS tokens across frames effectively captures temporal information
  3. Data Handling: Random temporal sampling helps avoid empty frames and improves training stability
  4. Transformer Architecture: ViT shows competitive performance for action recognition tasks
  5. Attention Mechanisms: Multi-head attention allows the model to focus on different spatial regions

🚀 Usage

Setup

  1. Install dependencies:

    pip install torch torchvision numpy matplotlib seaborn tqdm pyyaml tensorboard pytorch-lightning
    

  2. Ensure dataset is available:

  3. Dataset should be located at /home/nfs/inf6/data/datasets/kth_actions/processed/
  4. Or modify root_dir in the training script

  5. Open the notebook:

    jupyter notebook Assignment7.ipynb
    

Running Experiments

  1. Using the Notebook:
  2. Execute cells sequentially to:

    • Load and inspect the dataset
    • Define and initialize the ViT model
    • Train with different configurations
    • Evaluate models and visualize results
  3. Using the Training Script:

    python trainer.py
    

  4. Modify configs dictionary in trainer.py to change hyperparameters
  5. Configurations are automatically saved to YAML files

  6. Custom Configuration:

    configs = {   
        "model_name": "ViT",
        "batch_size": 32,
        "num_epochs": 100,
        "lr": 3e-4,
        "patch_size": 16,
        "token_dim": 192,
        "attn_dim": 192,
        "num_heads": 4,
        "mlp_size": 768,
        "num_tf_layers": 6,
        "num_classes": 6,
        "max_frames": 80,
        "slicing_step": 8
    }
    

Viewing TensorBoard Logs

tensorboard --logdir=tboard_logs

Then open http://localhost:6006 in your browser to view training curves.

Loading Saved Models

from utils import load_model
checkpoint = torch.load('checkpoints/checkpoint_ViT_patch_size_16_epochs_100.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
stats = checkpoint['stats']

🔧 Utility Functions

The utils.py file provides:

  • train_model(): Complete training loop with TensorBoard logging
  • train_epoch(): Training for one epoch
  • eval_model(): Model evaluation on validation/test set
  • save_model() / load_model(): Model checkpointing
  • count_model_params(): Count learnable parameters
  • smooth(): Loss curve smoothing for visualization
  • set_random_seed(): Reproducibility utilities

🎓 Extra Credit: Video Vision Transformer (ViViT)

The assignment mentions implementing ViViT with Space-Time attention as an extra credit task. ViViT extends ViT to explicitly model temporal relationships in video sequences using space-time attention mechanisms.

Key Differences from Standard ViT: - Space-Time Attention: Jointly attends to spatial and temporal dimensions - Temporal Modeling: Explicitly models relationships between frames - 3D Patches: Can process spatiotemporal patches instead of frame-by-frame

Reference: ViViT: A Video Vision Transformer

🔗 References


💬 Support

If you found this project helpful, you can support my work by buying me a coffee or via paypal!

Buy Me a Coffee

PayPal


Location

The complete assignment documentation, code, and notebooks are located in:

src/Assignment7/


This assignment demonstrates transformer-based architectures for video action recognition, exploring how Vision Transformers can be adapted for temporal sequence modeling.