3: Recurrent Neural Networks for Action Recognition
๐ Overview
This assignment focuses on implementing Recurrent Neural Networks (RNNs) from scratch and applying them to action recognition tasks. The project implements custom LSTM and Convolutional LSTM cells, then compares them with PyTorch's built-in RNN modules on the KTH-Actions dataset for video action classification.
Sample output from the RNN action recognition model
๐ฏ Objectives
- Implement LSTM and ConvLSTM cells from scratch
- Build an action recognition pipeline using RNNs
- Compare different RNN architectures (LSTMCell, GRUCell, custom implementations)
- Evaluate models on accuracy, training/inference time, and parameter count
- Implement 3D-CNN (R(2+1)d-Net) for action classification (extra credit)
๐ Dataset
KTH-Actions Dataset - Human action recognition dataset - Actions: walking, jogging, running, boxing, handwaving, handclapping - Frame size: 64ร64 pixels (grayscale) - Sequence length: 10 frames per sample - Split: Person IDs 0-16 for training, 17-25 for testing - Source: KTH-Actions Dataset
The dataset is automatically loaded using the custom KTHActionDataset class in src/dataloader.py.
๐๏ธ Models Implemented
1. Custom LSTM (OwnLSTM)
A fully custom LSTM implementation from scratch with the following components:
Architecture:
- Forget Gate: f_t = ฯ(W_f ยท [h_{t-1}, x_t] + b_f)
- Input Gate: i_t = ฯ(W_i ยท [h_{t-1}, x_t] + b_i)
- Candidate Gate: Cฬ_t = tanh(W_C ยท [h_{t-1}, x_t] + b_C)
- Output Gate: o_t = ฯ(W_o ยท [h_{t-1}, x_t] + b_o)
- Cell State: C_t = f_t โ C_{t-1} + i_t โ Cฬ_t
- Hidden State: h_t = o_t โ tanh(C_t)
Features: - Xavier weight initialization - Supports both single-step and sequence inputs - Custom forward pass implementation - Final linear layer for classification output
2. Convolutional LSTM Cell (ConvLSTMCell)
A convolutional variant of LSTM that preserves spatial information:
Architecture: - Uses 1D convolutions instead of linear layers - Maintains spatial dimensions through the sequence - Separate convolutional layers for each gate (forget, input, candidate, output) - Kernel size: 3 (default), with padding to preserve dimensions
Features: - Processes spatial-temporal data efficiently - Suitable for video sequences with spatial structure - Custom implementation matching standard ConvLSTM formulation
3. Action Classifier
A complete action recognition model with three main components:
Encoder
- Option 1: Custom CNN encoder
- 5 convolutional blocks with BatchNorm and GELU activation
- Progressive channel expansion: 1 โ 16 โ 32 โ 64 โ 128 โ emb_dim
-
Adaptive average pooling to fixed size
-
Option 2: Pretrained ResNet18 encoder
- Modified first layer for grayscale input (1 channel)
- Feature extraction with projection to embedding dimension
Recurrent Module
Supports multiple RNN architectures: - LSTMCell: PyTorch's built-in LSTM cell - GRUCell: PyTorch's built-in GRU cell - OwnLSTM: Custom LSTM implementation - OwnConvLSTM: Custom ConvLSTM implementation
Classifier
- Conv1d layer for temporal feature extraction
- Adaptive average pooling
- Fully connected layer for final classification (6 classes)
๐ฌ Experiments
The project includes multiple experiments comparing different RNN architectures:
| Experiment | RNN Type | Pretrained Encoder | Scheduler | Description |
|---|---|---|---|---|
| LSTMCell | PyTorch LSTM | โ | โ | Baseline with PyTorch LSTM |
| LSTMCell_NoScheduler | PyTorch LSTM | โ | โ | LSTM without learning rate scheduling |
| GRUCell | PyTorch GRU | โ | โ | GRU-based model |
| GRUCell_NoScheduler | PyTorch GRU | โ | โ | GRU without scheduling |
| OwnLSTM | Custom LSTM | โ | โ | Custom LSTM implementation |
| LSTMCell_PretEncoder | PyTorch LSTM | โ | โ | LSTM with pretrained ResNet encoder |
| LSTMCell_PretEncoder_Scheduler | PyTorch LSTM | โ | โ | LSTM with pretrained encoder + scheduler |
Training Configuration
All experiments use: - Optimizer: Adam - Learning rate: 0.001 (with optional scheduler) - Batch size: 32 - Epochs: 50-100 (varies by experiment) - Loss function: CrossEntropyLoss - Embedding dimension: 128 - Hidden dimension: 128 - Number of layers: 2
๐ ๏ธ Key Features
Data Augmentation
Spatial Augmentations: - Random horizontal flip (p=0.5) - Random rotation (ยฑ25 degrees)
Temporal Augmentations: - Random temporal sampling (slicing step) - Random temporal reversal (p=0.3)
Training Infrastructure
- TensorBoard logging: Training/validation loss, accuracy, and learning rate curves
- Model checkpointing: Saves best models with training configurations
- Progress tracking: Real-time training progress with tqdm
- Evaluation metrics: Accuracy, per-class performance
- Experiment management: YAML configuration files for each experiment
Custom Utilities
- Seed management: Reproducible experiments
- Model evaluation: Comprehensive evaluation functions
- Visualization: Sequence visualization tools
- Data loading: Efficient dataset handling with proper train/test splits
๐ Project Structure
Assignment3/
โโโ Assignment3.ipynb # Main assignment notebook
โโโ session3.ipynb # Lab session materials
โโโ src/
โ โโโ models.py # Custom LSTM and ConvLSTM implementations
โ โโโ dataloader.py # KTHActionDataset class
โ โโโ transformations.py # Data augmentation transforms
โ โโโ utils.py # Training and evaluation utilities
โ โโโ devel/
โ โโโ task1.ipynb # Task 1 development notebook
โ โโโ task2.ipynb # Task 2 development notebook
โ โโโ task3.ipynb # Task 3 (extra credit) notebook
โโโ data/
โ โโโ README.md # Dataset information
โโโ models/
โ โโโ README.md # Model checkpoints directory
โโโ tboard_logs/ # TensorBoard logs for all experiments
โ โโโ LSTMCell/
โ โโโ GRUCell/
โ โโโ OwnLSTM/
โ โโโ ...
โโโ imgs/ # Visualization images and GIFs
โโโ pipeline.png
โโโ gif_*.gif
โโโ ...
๐ Analysis & Results
Model Comparison
The notebook includes comprehensive analysis: - Learning curves: Training vs validation loss and accuracy over epochs - Performance metrics: Overall and per-class accuracy - Parameter count: Comparison of model sizes - Training/inference time: Efficiency analysis - Failure case analysis: Visualization of misclassified sequences
Key Findings
- GRU Performance: GRUCell achieved the best performance on the dataset
- LSTM vs GRU: GRU's simpler architecture (no cell state) can be more efficient while maintaining performance
- Custom Implementation: OwnLSTM showed competitive results, validating the implementation
- Pretrained Encoders: Using pretrained ResNet encoders improved feature extraction
- Learning Rate Scheduling: Schedulers helped stabilize training and improve convergence
- Temporal Augmentations: Effective for improving generalization
๐ Usage
Running the Notebook
-
Install dependencies:
-
Download the KTH-Actions dataset:
- The dataset should be placed in the appropriate directory
-
Or modify the
root_dirparameter inKTHActionDataset -
Open the notebook:
-
Run experiments: Execute cells sequentially to:
- Implement custom LSTM and ConvLSTM cells (Task 1)
- Load and preprocess the KTH-Actions dataset
- Train different RNN architectures (Task 2)
- Evaluate and compare models
- Visualize results
Viewing TensorBoard Logs
Then open http://localhost:6006 in your browser to view training curves for all experiments.
Loading Saved Models
checkpoint = torch.load('models/experiment_name/checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
Using Custom Models
from src.models import OwnLSTM, ConvLSTMCell
from src.dataloader import KTHActionDataset
from src.transformations import get_train_transforms, get_test_transforms
# Initialize custom LSTM
lstm = OwnLSTM(input_size=128, hidden_size=128, output_size=128)
# Load dataset
train_dataset = KTHActionDataset(
root_dir='path/to/kth_actions',
split='train',
transform=get_train_transforms(slicing_step=2),
max_frames=10,
img_size=(64, 64)
)
๐ Extra Credit: 3D-CNN Implementation
The project includes an implementation of R(2+1)d-Net for action recognition:
- Architecture: Factorized 3D convolutions (2D spatial + 1D temporal)
- Advantages: More efficient than full 3D convolutions while maintaining performance
- Comparison: Evaluated against RNN-based models
See src/devel/task3.ipynb for implementation details.
๐ References
- KTH-Actions Dataset
- Understanding LSTMs
- Convolutional LSTM Network
- R(2+1)D Networks
- PyTorch Documentation
- TensorBoard
Date: 18.05.2025
๐ฌ Support
If you found this project helpful, you can support my work by buying me a coffee or via PayPal!
Location
The complete assignment documentation, code, and notebooks are located in:
This assignment demonstrates deep understanding of recurrent neural networks, including custom implementations of LSTM and ConvLSTM cells, and their application to video action recognition tasks.