This repository implements a comprehensive framework for Group Activity Recognition (GAR) in volleyball videos, based on the research paper:
A Hierarchical Deep Temporal Model for Group Activity Recognition
Mostafa S. Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, Greg Mori. IEEE Computer Vision and Pattern Recognition 2016
This project focuses on recognizing complex group activities in volleyball games by analyzing the temporal dynamics and spatial relationships between multiple players. The system can identify various volleyball-specific group activities such as right/left team spikes, sets, passes, and winning points.
Group Activity Recognition in sports videos is challenging because it requires:
- Understanding individual player actions
- Modeling temporal dependencies across frames
- Capturing spatial relationships between multiple players
- Recognizing coordinated team activities
The implementation includes 8 different baseline models that progressively increase in complexity:
| Model | Description | Architecture | Test Accuracy |
|---|---|---|---|
| Baseline 1 | Simple ResNet-50 | Single CNN for frame-level classification | 74.83% |
| Baseline 3A | Feature extraction model | ResNet-based feature extractor for individual actions | 78.27% |
| Baseline 3B | Enhanced feature model | Improved version of 3A with better feature representation | 82.12% |
| Baseline 4 | Temporal modeling | LSTM-based temporal sequence modeling | 81.08% |
| Baseline 5 | Multi-stream approach | Multiple input streams for different modalities | 83.40% |
| Baseline 6 | Attention mechanism | Attention-based temporal modeling | 77.86% |
| Baseline 7 | Hierarchical modeling | Multi-level temporal and spatial modeling | 86.46% |
| Baseline 8 | Advanced LSTM | Dual LSTM with team-based aggregation | 89.08% |
- Feature Extraction: Uses pre-trained ResNet models to extract 2048-dimensional features from player bounding boxes
- Temporal Modeling: LSTM networks to capture sequential dependencies
- Spatial Aggregation: Team-based feature aggregation for group activity recognition
- Multi-level Classification: Hierarchical approach from individual actions to group activities
The system works with volleyball video datasets containing:
- Individual Actions: 9 action classes (waiting, setting, digging, falling, spiking, blocking, jumping, moving, standing)
- Group Activities: 8 group activity classes (r_spike, r_set, r-pass, r_winpoint, l-spike, l_set, l-pass, l_winpoint)
- Player Tracking: Bounding box annotations for each player across video frames
- Temporal Segmentation: Video clips annotated with group activity labels
- Python 3.13.7+
- uv (Python package manager)
- CUDA (recommended for training)
- Clone the repository:
git clone <repository-url>
cd volleyball- Install uv (if not already installed):
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies using uv:
uv syncThis will automatically:
- Create a virtual environment
- Install all dependencies from
pyproject.toml - Set up the project for development
The project uses modern Python packaging with pyproject.toml:
- Dependencies: PyTorch, torchvision, Pillow, matplotlib, scikit-learn
- Python Version: 3.13.7+
- Code Quality: Ruff for linting and formatting
- Package Manager: uv for fast dependency resolution
volleyball/
βββ models/ # Model architectures (8 baselines)
βββ datasets/ # Dataset loaders and utilities
βββ trainers/ # Training scripts for each baseline
βββ utils/ # Utility functions and helpers
βββ extract_features.py # Feature extraction pipeline
βββ constants.py # Configuration constants
βββ trained_models/ # Pre-trained model weights
βββ confusion_matrix/ # Confusion matrix visualizations
βββ loss_accuracy/ # Training curves
βββ logs/ # Training logs
Extract deep features from video frames using pre-trained models:
uv run python extract_features.pyThis script:
- Loads video frames and player bounding boxes
- Crops individual player regions
- Extracts 2048-dimensional features using Baseline 3A model
- Saves features for training/validation/test splits
Train any of the 8 baseline models:
# Train Baseline 1 (ResNet-50)
uv run python trainers/train_b1.py
# Train Baseline 8 (Advanced LSTM)
uv run python trainers/train_b8.pyModels are automatically evaluated on test sets during training, generating:
- Confusion matrices for training and validation
- Loss and accuracy curves
- F1 scores and classification metrics
The repository includes comprehensive evaluation results with Baseline 8 achieving the highest test accuracy of 89.08%:
- Best Performing Model: Baseline 8 (Advanced LSTM) - 89.08% accuracy
- Strong Temporal Models: Baseline 7 (86.46%) and Baseline 5 (83.40%) show the importance of temporal modeling
- Feature Quality Impact: Baseline 3B (82.12%) outperforms 3A (78.27%), demonstrating improved feature representation
- Baseline Comparison: Simple ResNet-50 (Baseline 1) achieves 74.83%, providing a solid foundation
- Temporal Modeling is Critical: Models with LSTM components (B4, B5, B7, B8) consistently outperform frame-based approaches
- Team-based Aggregation Works: Baseline 8's dual LSTM with team-based feature aggregation achieves the best results
- Hierarchical Approaches Excel: Multi-level modeling (B7, B8) captures both individual and group dynamics effectively
- Feature Quality Matters: Enhanced feature extraction (B3B vs B3A) provides measurable improvements
- Training Curves: Loss and accuracy plots for all baselines
- Confusion Matrices: Detailed classification performance analysis
- Model Weights: Pre-trained models for immediate use
- Logs: Detailed training logs for reproducibility
This implementation provides:
- Comprehensive Baselines: 8 different approaches to GAR
- Volleyball-Specific Modeling: Domain-adapted for sports analysis
- Temporal Dynamics: Advanced LSTM-based sequence modeling
- Spatial Relationships: Team-based feature aggregation
- Reproducible Results: Complete training and evaluation pipeline
# Activate the virtual environment
uv shell
# Run scripts in the virtual environment
uv run python script.py
# Add new dependencies
uv add package-name
# Update dependencies
uv sync
# Run linting and formatting
uv run ruff check .
uv run ruff format .- Create a new model class in
models/ - Implement the forward pass
- Create a training script in
trainers/ - Update constants if needed
Edit constants.py to:
- Add new individual actions
- Define new group activities
- Adjust feature dimensions
Contributions are welcome! Please feel free to:
- Report bugs and issues
- Suggest new model architectures
- Improve documentation
- Add new evaluation metrics
This project is for research purposes. Please cite the original paper if you use this implementation in your research.
- Original authors for the foundational research
- PyTorch community for the deep learning framework
- Computer vision research community for open-source tools and datasets
Note: This implementation focuses on volleyball group activity recognition and serves as a comprehensive baseline for sports video analysis research.