7: Vision Transformers for Action Recognition
📋 Overview
This assignment focuses on implementing and training Vision Transformers (ViT) for action recognition on the KTH-Actions dataset. The project explores transformer-based architectures for video classification, comparing different patch sizes and evaluating performance against RNN models from Assignment 3.
Sample output from the Vision Transformer on KTH-Actions (visualization)
🎯 Objectives
- Implement a Vision Transformer (ViT) for action recognition on video sequences
- Process video frames by breaking them into patches and jointly processing with transformers
- Compare different patch sizes and their impact on model performance
- Evaluate ViT performance against RNN models from Assignment 3
- Explore Video Vision Transformer (ViViT) with Space-Time attention (extra credit)
📊 Dataset
KTH-Actions Dataset - 6-class action recognition dataset - Task: Action recognition from video sequences - Classes: walking, jogging, running, boxing, handwaving, handclapping - Image size: 64×64×1 (grayscale) - Frame processing: - Maximum 80 frames per sequence - Temporal slicing with step size of 8 (resulting in 10 frames per training sample) - Random temporal sampling to handle dataset disparities (empty frames) - Split: - Training: Person IDs 0-16 - Testing: Person IDs 17-25
The dataset is located at /home/nfs/inf6/data/datasets/kth_actions/processed/
🏗️ Models Implemented
Vision Transformer (ViT)
A transformer-based architecture for video action recognition:
Architecture Components: - Patchifier: Breaks each frame into patches (configurable patch size) - Patch Projection: Projects patches to token dimension with LayerNorm - CLS Token: Learnable classification token for each frame - Positional Encoding: Adds positional information to tokens - Transformer Blocks: Stack of transformer encoder blocks with: - Multi-Head Self-Attention - MLP (Feed-Forward Network) - Residual connections and LayerNorm - Classifier: Linear layer for final classification
Key Features: - Processes video sequences frame-by-frame - Each frame is divided into patches and processed independently - CLS tokens from all frames are averaged for final classification - Supports configurable patch sizes, token dimensions, and number of layers
Default Configuration: - Patch size: 16×16 (configurable: 8, 16, 32, 64) - Token dimension: 192 - Attention dimension: 192 - Number of heads: 4 - MLP size: 768 - Number of transformer layers: 6 - Number of classes: 6
Multi-Head Self-Attention
Implements scaled dot-product attention with multiple heads: - Efficient head splitting and merging for batch processing - Supports 4D input tensors (batch, sequence, tokens, dimensions) - Attention maps can be extracted for visualization
Transformer Block
Standard transformer encoder block: - Multi-Head Self-Attention - Residual connections - Layer Normalization - MLP with GELU activation - Dropout for regularization
🔬 Experiments
The project includes experiments comparing different configurations:
| Configuration | Patch Size | Epochs | Token Dim | MLP Size | Layers |
|---|---|---|---|---|---|
| ViT_patch_size_8 | 8×8 | 60 | 128 | 512 | 4 |
| ViT_patch_size_16_epochs_60 | 16×16 | 60 | 192 | 768 | 6 |
| ViT_patch_size_16_epochs_100 | 16×16 | 100 | 192 | 768 | 6 |
| ViT_patch_size_32 | 32×32 | 60 | - | - | - |
| ViT_patch_size_64 | 64×64 | 60 | - | - | - |
Training Configuration
- Optimizer: Adam
- Learning rate: 3e-4
- Batch size: 32
- Epochs: 60-100 (configurable)
- Loss function: CrossEntropyLoss
- Scheduler: StepLR (step_size=10, gamma=1/3, optional)
- Validation: Evaluated on test set
🛠️ Key Features
Data Preprocessing
Temporal Processing: - Random temporal sampling: Selects 80 frames with random start index from each sequence - Temporal slicing: Samples every 8th frame (resulting in 10 frames per sample) - Handles dataset disparities by avoiding empty frames
Spatial Augmentations (Training): - Random horizontal flip (p=0.5) - Random rotation (±25 degrees)
Temporal Augmentations (Training): - Random temporal sampling (slicing step=8) - Random temporal reversal (p=0.3)
Test Transforms: - Temporal sampling only (no augmentation)
Training Infrastructure
- TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
- Model checkpointing: Saves best models with training configurations
- Progress tracking: Real-time training progress with tqdm
- Evaluation metrics: Accuracy, loss tracking
- Configuration management: YAML-based config files for experiment tracking
- Reproducibility: Fixed random seeds for consistent results
DataLoader
KTHActionDataset: - Handles video sequence loading - Supports train/test splits based on person IDs - Random frame selection within sequences - Padding for sequences shorter than max_frames - Grayscale image conversion and resizing
📁 Project Structure
Assignment7/
├── Assignment7.ipynb # Main assignment notebook
├── Session7.ipynb # Lab session materials
├── models.py # ViT and transformer components
├── trainer.py # Training script
├── utils.py # Utility functions (training, evaluation, visualization)
├── dataloader.py # KTHActionDataset implementation
├── transformations.py # Data augmentation transforms
├── configs/ # Experiment configuration files
│ ├── ViT_patch_size_16_epochs_100.yaml
│ └── ViT_patch_size_16_epochs_60.yaml
├── tboard_logs/ # TensorBoard logs
│ ├── ViT_patch_size_8_epochs_60/
│ ├── ViT_patch_size_16_epochs_60/
│ ├── ViT_patch_size_16_epochs_100/
│ ├── ViT_patch_size_32_epochs_60/
│ └── ViT_patch_size_64_epochs_60/
└── resources/ # Reference images and documentation
├── vit_img.png
├── seminar.png
└── ...
📈 Analysis & Results
Model Comparison
The notebook includes comprehensive analysis: - Learning curves: Training vs validation loss over epochs - Accuracy metrics: Overall classification accuracy - Patch size comparison: Performance across different patch sizes - Comparison with RNN models: Evaluation against Assignment 3 results
Key Findings
- Patch Size Impact: Smaller patch sizes (8×8) provide more tokens per frame but increase computational cost
- Temporal Processing: Averaging CLS tokens across frames effectively captures temporal information
- Data Handling: Random temporal sampling helps avoid empty frames and improves training stability
- Transformer Architecture: ViT shows competitive performance for action recognition tasks
- Attention Mechanisms: Multi-head attention allows the model to focus on different spatial regions
🚀 Usage
Setup
-
Install dependencies:
-
Ensure dataset is available:
- Dataset should be located at
/home/nfs/inf6/data/datasets/kth_actions/processed/ -
Or modify
root_dirin the training script -
Open the notebook:
Running Experiments
- Using the Notebook:
-
Execute cells sequentially to:
- Load and inspect the dataset
- Define and initialize the ViT model
- Train with different configurations
- Evaluate models and visualize results
-
Using the Training Script:
- Modify configs dictionary in
trainer.pyto change hyperparameters -
Configurations are automatically saved to YAML files
-
Custom Configuration:
Viewing TensorBoard Logs
Then open http://localhost:6006 in your browser to view training curves.
Loading Saved Models
from utils import load_model
checkpoint = torch.load('checkpoints/checkpoint_ViT_patch_size_16_epochs_100.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
stats = checkpoint['stats']
🔧 Utility Functions
The utils.py file provides:
train_model(): Complete training loop with TensorBoard loggingtrain_epoch(): Training for one epocheval_model(): Model evaluation on validation/test setsave_model()/load_model(): Model checkpointingcount_model_params(): Count learnable parameterssmooth(): Loss curve smoothing for visualizationset_random_seed(): Reproducibility utilities
🎓 Extra Credit: Video Vision Transformer (ViViT)
The assignment mentions implementing ViViT with Space-Time attention as an extra credit task. ViViT extends ViT to explicitly model temporal relationships in video sequences using space-time attention mechanisms.
Key Differences from Standard ViT: - Space-Time Attention: Jointly attends to spatial and temporal dimensions - Temporal Modeling: Explicitly models relationships between frames - 3D Patches: Can process spatiotemporal patches instead of frame-by-frame
Reference: ViViT: A Video Vision Transformer
🔗 References
- Vision Transformer (ViT) Paper
- ViViT: A Video Vision Transformer
- KTH-Actions Dataset
- PyTorch Documentation
- TensorBoard
- Attention Is All You Need
💬 Support
If you found this project helpful, you can support my work by buying me a coffee or via paypal!
Location
The complete assignment documentation, code, and notebooks are located in:
This assignment demonstrates transformer-based architectures for video action recognition, exploring how Vision Transformers can be adapted for temporal sequence modeling.