This project explores the application of Natural Language Processing (NLP) and machine learning techniques to predict movie revenues using various features including reviews, crew information, and genres. The project progresses from a baseline model to an advanced neural network utilizing BERT embeddings.
The project implements four increasingly complex models to predict movie revenues:
- Baseline Model (Average Revenue Prediction)
- Linear Regression with TF-IDF
- Basic Neural Network
- Advanced Neural Network with BERT Embeddings
Each model demonstrates improvements in prediction accuracy, with the final BERT-based model achieving the best performance.
The project combines two datasets from Kaggle to data set call filtered_with_review:
- Movie names
- Release dates
- Scores
- Genres
- Reviews
- Crew information
- Budget
- Revenue
- Country
- Language
- Status
Model | MAE ($) | MSE ($) | RMSE ($) |
---|---|---|---|
Baseline | 121,433,652.96 | 23,676,641,432,614,836.00 | 153,872,159.38 |
Linear Regression | 92,968,624.58 | 16,097,876,478,672,158.00 | 126,877,407.28 |
Simple Neural Network | 81,014,284.61 | 13,285,544,772,123,966.00 | 115,262,937.55 |
Advanced Neural Network | 73,535,891.10 | 11,611,688,000,320,616.00 | 107,757,542.66 |
Overall improvements from baseline to final model:
- 39.44% reduction in Mean Absolute Error
- 50.96% reduction in Mean Squared Error
- 29.97% reduction in Root Mean Squared Error
- Simple average revenue prediction
- Used as a benchmark for comparison
- TF-IDF vectorization for text features
- One-hot encoding for categorical features
- Power transformation for numerical features
- Date component extraction
- Two hidden layers (128 and 64 neurons)
- Batch normalization and dropout
- ReLU activation
- Xavier initialization
- Adam optimizer with learning rate scheduling
- Five fully connected layers (1024 → 512 → 256 → 128 → 1)
- BERT embeddings for text processing
- Layer normalization
- GELU activation
- Orthogonal initialization
- Hybrid loss function (MSE + Huber)
- Cosine annealing scheduler
# Core requirements
numpy
pandas
sklearn
tensorflow
torch
transformers
Potential improvements and extensions:
- Incorporate temporal trends and seasonal patterns
- Explore alternative transformer-based language models
- Develop ensemble methods combining multiple architectures
- Add more features related to movie marketing and distribution