Movie Revenue Prediction

Introduction

In the contemporary film industry, accurately predicting a movie's earnings is paramount for maximizing profitability. This project aims to develop a sophisticated machine learning model to forecast movie earnings based on a comprehensive set of input features, including the movie name, MPAA rating, genre, year of release, IMDb rating, votes by the watchers, director, writer, leading cast, country of production, budget, production company, and runtime.

Numerous factors influence a movie's earnings, and the optimal combination of these factors remains elusive. Our machine learning model seeks to uncover the most significant factors for box office success by analyzing real data from a diverse array of movies produced globally.

We hypothesize that certain parameters, such as the director's track record and the genre of the film, hold more significance in predicting movie revenue than others. Observations suggest that despite lower IMDb ratings, action-oriented films often perform well at the box office, while genres such as comedy or emotional dramas, despite potentially higher IMDb ratings, may not achieve comparable revenue outcomes. These insights highlight the complex interplay between film attributes and audience preferences.

By leveraging these diverse attributes, our goal is to construct a robust predictive model that can offer valuable insights and aid decision-making in the film industry. Ultimately, this will help filmmakers optimize their movie production strategies for maximum profit and popularity.

Directory Structure

Movie-Revenue-Prediction/
│
├── fig
│   └─ intro.png
│
├── Helper Files
│   ├── Best Features
│   │   ├── feature_scores.py
│   │   ├── feature_scores.txt
│   │   ├── significant_features.py
│   │   └── significant_features.txt
│   ├── budgetxgross.py
│   ├── data_visualization.py
│   ├── gross_histogram.py
│   ├── null_values_check.py
│   └── pie_chart.py
│
├── Misc
│   └─ initial_try.py
│
├── models
│   ├── accuracies.txt
│   ├── decision_tree_bagging.py
│   ├── decision_tree.py
│   ├── feature_scaling.py
│   ├── gradient_boost.py
│   ├── linear_regression_pca.py
│   ├── linear_regression.py
|   ├── random_forest.py
│   ├── tracking_XGBoost.py
│   └── XGBoost.py
│
├── old datasets
│   ├── finalised dataset
│   │   ├── dataset_modified.py
│   │   ├── masti.csv
│   │   ├── new_updated_less-than-1b-dataset.csv
│   │   ├── new_updated_less-than-350m-dataset.csv
│   │   ├── old_data.csv
│   │   └── updated_masti.csv
│   ├── initial
│   │   ├── initial_dataset.csv
│   │   └── initial_merge.csv
│   ├── Intermediate
│   │   ├── intermediate_dataset.csv
│   │   └── intermediate_merge.csv
│   │   └── intermediate1_dataset.csv
│   ├── Kaggle
│   │   ├── IMDb 5000+.csv
│   │   ├── movie_data_imdb.csv
│   │   ├── movie_metadata.csv
│   │   └── top_500_movies.csv
│   │
│   ├── data_builder_check.py
│   ├── dataset.csv
│   ├── dataset2.csv
│   ├── final_dataset.csv
│   ├── final_merge.csv
│   └── README.md
│
├── Reports
│   ├── Proposal
│   │   ├── proposal.pdf
│   │   └── proposal.tex
│   ├── 1st Project Report
│   │   ├── 1st_Project_Report.pdf
│   │   ├── 1st_Project_Report.tex
│   │   └── pics
│   │       ├── gross_histogram.png
│   │       ├── k_best.png
│   │       ├── model_accuracy_plot.png
│   │       ├── null_values.png
│   └── Final Report
│       ├── Final_Report.pdf
│       ├── Final_Report.tex
│       └── pics
│           ├── gross_histogram.png
│           ├── k_best.png
│           ├── model_accuracy_plot.png
│           ├── null_values.png
│           ├── pie_chart.png
│           └── R2_score_tracking.png
│
├── revised datasets
│   ├── movies.csv
|   ├── output.csv
│   └── README.md
│
├── .gitignore
├── LICENSE
├── main.py
├── README.md
├── requirements.txt
└── streamlit_app.py

Getting Started

All our code was tested on Python 3.6.8 with scikit-learn 1.3.2. Ideally, our scripts require access to a single GPU (uses .cuda() for inference). Inference can also be done on CPUs with minimal changes to the scripts.

Setting up the Environment

We recommend setting up a Python virtual environment and installing all the requirements. Please follow these steps to set up the project folder correctly:

git clone https://github.com/Vikranth3140/Movie-Revenue-Prediction.git
cd Movie-Revenue-Prediction

python3 -m venv ./env
source env/bin/activate

pip install -r requirements.txt

Setting up Datasets

The datasets have been taken from Movie Industry dataset.
Detailed instructions on how to set up our revised datasets are provided in revised datasets\README.md.

Running the Models

You can run the models using:

python <model_name>.py

The model_name parameter can be one of [linear_regression, decision_tree, random_forest, decision_tree_bagging, gradient_boost, XGBoost].

Data Preprocessing

We provide scripts for data preprocessing, including handling missing values, encoding categorical variables, feature scaling, and feature engineering.

Handling Missing Values

Missing values are handled using the SimpleImputer with a median strategy in the feature_scaling.py script:

Encoding Categorical Variables

Categorical variables are encoded using Label Encoding. This is implemented in the feature_scaling.py script which is called before the training of every model

Feature Scaling

Enhanced feature preprocessing is implemented in feature_scaling.py:

Log transformation for skewed numerical features, particularly budget and revenue StandardScaler applied to normalize numerical features

These preprocessing steps have resulted in substantially improved model performance, with significantly lower Mean Absolute Percantage Error(MAPE) and Mean Squared Logarithmic Error (MSLE).

Feature Engineering

New features are created in our models:

vote_score_ratio
budget_year_ratio
vote_year_ratio
score_runtime_ratio
budget_per_minute
votes_per_year

Binary features introduced:

is_recent
is_high_budget
is_high_votes
is_high_score

These engineered features capture complex relationships and trends in the data, enhancing our model's ability to discover patterns. The combination of ratio-based and binary features provides a richer representation of the movie attributes, leading to improved predictive performance across our various models

Feature Selection

We use SelectKBest for helping us know which features contribute the most towards our target variable, as implemented in the significant_features.py and feature_scores.py scripts.

Model Improvement

We employed strategies such as hyperparameter tuning using GridSearchCV for model improvement.

Hyperparameter Tuning

Hyperparameter tuning is performed using GridSearchCV to optimize model parameters.

Command Line Interface (CLI)

We have developed a Command Line Interface (CLI) to allow users to input movie features and get revenue predictions. This tool provides an estimate of the inputted movie's revenue within specific ranges:

Low Revenue: <= $10M
Medium-Low Revenue: $10M - $40M
Medium Revenue: $40M - $70M
Medium-High Revenue: $70M - $120M
High Revenue: $120M - $200M
Ultra High Revenue: >= $200M

Using the CLI

Navigate to the project directory.
Run the CLI:
```
python main.py
```
Follow the prompts to input the movie features and choose the prediction model.

Streamlit Web Interface

Additionally a web interface is also developed using Streamlit to allow users to input movie features and get revenue predictions.

Running the Web Interface

Navigate to the project directory.
Run the Web Interface:
```
streamlit run streamlit_app.py
```
Follow the prompts to input the movie features and choose the prediction model.

Model Evaluation Results

We evaluated our models using two key metrics: R² Score (Coefficient of Determination) and MSLE (Mean Squared Logarithmic Error). Here are the results for each model:

Model	Training R²	Training MSLE	Testing R²	Testing MSLE
Linear Regression	0.6181	0.0053	0.6520	0.0051
Decision Tree	0.8310	0.0024	0.5994	0.0059
Bagging	0.8380	0.0023	0.7105	0.0042
Gradient Boosting	0.8750	0.0016	0.7350	0.0040
XGBoosting	0.8633	0.0018	0.7402	0.0041
Random Forest	0.8475	0.0022	0.7235	0.0041

Conclusion

The developed Gradient Boosting and XGBoost models demonstrates promising accuracy and generalization capabilities, facilitating informed decision-making in the film industry to maximize profits.

Citation

If you found this work useful, please consider citing it as:

@misc{udandarao2024movie,
      title={Movie Revenue Prediction using Machine Learning Models},
      author={Vikranth Udandarao and Pratyush Gupta},
      year={2024},
      eprint={2405.11651},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Contact

Please feel free to open an issue or email us at [email protected] or [email protected].

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Movie Revenue Prediction

Introduction

Directory Structure

Getting Started

Setting up the Environment

Setting up Datasets

Running the Models

Data Preprocessing

Handling Missing Values

Encoding Categorical Variables

Feature Scaling

Feature Engineering

Feature Selection

Model Improvement

Hyperparameter Tuning

Command Line Interface (CLI)

Using the CLI

Streamlit Web Interface

Running the Web Interface

Model Evaluation Results

Conclusion

Citation

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Helper files		Helper files
Misc		Misc
Reports		Reports
fig		fig
models		models
old datasets		old datasets
revised datasets		revised datasets
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Paper.pdf		Paper.pdf
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

License

Vikranth3140/Movie-Revenue-Prediction

Folders and files

Latest commit

History

Repository files navigation

Movie Revenue Prediction

Introduction

Directory Structure

Getting Started

Setting up the Environment

Setting up Datasets

Running the Models

Data Preprocessing

Handling Missing Values

Encoding Categorical Variables

Feature Scaling

Feature Engineering

Feature Selection

Model Improvement

Hyperparameter Tuning

Command Line Interface (CLI)

Using the CLI

Streamlit Web Interface

Running the Web Interface

Model Evaluation Results

Conclusion

Citation

Contact

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages