This project tries to extract insights and patterns of YouTube's current most popular videos of a specific region (Country; here INDIA). Over 20 important attributes of each video is analyzed using Pandas, NumPy, etc. and insights are presented in vizualizations using Matplotlib and Seaborn.
This project starts with understanding the resources, methods, request parameters, structure of requested data, etc. for YouTube Data API v3. Then, for robust analysis, there is a need of Database to store data, collected at different timestamps over a long period (like 1 month or more). Here, I have explored the opportunity of using a Cloud Database, levaraging the benefits of Google Cloud Platform ( using its free-tier/always-free products only!! ). I utilized their Compute Engine as virtual machine to install and set-up the database (NOTE: The Cloud SQL of GCP is not included in the free tier).
This analysis may help anyone strategize their YouTube journey by understanding user preferences, current trends, improvement scopes etc.
- Some Vizualizations
- Required Pyhton libraries and modules
- Setting-up Compute Engine on Google Cloud Platform
- Setting-up PostgreSQL on Compute Engine
- Creating Firewall rule for VM
- Enabling YouTube Data API v3
- Understanding YouTube Data API v3
- Storing all credentials as Environment Variables
- Data Transformation for efficient memory usage
- Method used to efficiently load data into Database
- Interacting with Database
- Closing Database Connections
- Transformed/Generated Columns
- Inference, Hypothesis, Validation from Analysis
- Abbreviations Used
- Acknowledgement
Install with your python package manager, ex.- pip, if not already installed,
as these need to be imported. Like - pip install package_name
.
- datetime
- dotenv
pip install python-dotenv
- io
- matplotlib
- numpy
- os
- pandas
- pprint
- psycopg2
- random
- requests
- seaborn
You first need a google billing account and a normal google account ( which can be same ). But don't worry, as Google will not charge anything if you are within limits. Also, turn off the auto payments of your billing account so that even if you cross free usage limits, you will not be charged ( your service will be terminated if you don't pay ).
Then go to Google Cloud Console by logging in with the google account and create a new project. Then go to Compute Engine and create a new VM instance. You will need to add the billing account here. While choosing the right specifications, follow this free tier usage limits to avoid any billing.
For any further guides or queries, please follow Google's Documentation.
Please follow the detailed Google Cloud Community Tutorial contributed by Google employees to set up a PostgreSQL database in your virtual machine and configure it for remote connections.
Note: The guide uses CIDR suffix /32
, which means a single IPV4 address.
Which is OK for static ip addresses.
But it is most likely that you have dynamic ip address and for that you need
to identify right CIDR suffix by knowing the subnet mask of your network.
Ex.- For 255.255.255.0 types, it should be /24
.
Now, in order to remotely connect with VM and then the database, we need to create a firewall rule on our compute engine. The previous Tutorial includes that but here also, remember to replace the CIDR suffix with the right one as applicable.
From the navigation menu of google cloud console, go to APIs & Services. Then follow the path below:
Library
Now from the dashboard, you need to create Credentials for the API as the API Key is required. Note that the usage of this API is free with limitation / Quota of 1000 units per day. And our each API call costs 1 unit.
One really needs to understand the different methods of the call, calling parameters and the description of output parameters to implement what is intended. For this, please refer to the Guides and Reference section of Official Documentation.
Here, we can play around with all the parameters ( check hidden parameters also! ) without spending our daily quotas. One can also see what the response will look like and verify whether the API call was right or incomplete/wrong. The response data will generally be in a nested JSON format.
After ensuring the all the parameters are set properly and the call is giving back response with status code 200, you need to choose SHOW CODE and grab the https URL, automatically generated against your specified parameters. For more info about setting parameters, and slicing parts of some parameters, please follow this guide
In this project, the region is set to IN which is the ISO 3166-1 alpha-2 code for my country India. But you can change it to any other country code as acceptable by YouTube, by simply changing the variable region_code
in 1st Notebook. Also, I have collected 100 videos ( 50 videos in 2 times each ) although there could be more videos available, upto 200. The process to extract all of them using loop is explained in 1st Notebook.
So, we have to use few credentials for the entire project. These includes your API key for pulling data and database credentials for connecting to the cloud database. These are secrets and should not be published in public. Also, it is better, not to hard code these variables, following one of the 12-factor principles.
Python dotenv
library provides a good solution. It reads key-value pairs from a .env
file and can set them as environment variables. Then we can directly use os.getenv()
to get the environment variables.
This repository contains a sample .env file which provides a template for the .env
key-value pairs. Please insert the actual values and then remove the '_example' part from the name of the file.
For more info and advanced configuration, please visit here.
After normalizing JSON
data using pandas.json_normalize()
, we need to drop redundant columns after extracting useful information from them. Then applying appropriate datatypes we can achieve around 50% reduced memory usage per column ( as there is need for additional data wrangling which changes the no. of columns ) . This is because, pandas
often store data as objects and our data is mostly in string format even for numeric columns. Also declaring proper categorical column helps a lot!
This might not be that useful in our situation but it is implemented for scalability. Also, we are using the fastest copy_from(StringIO,...)
method to load data into database. But it uses high memory proportional to the memory usage of the DataFrame
. So it is better to perform data transformation which also makes possible to run a quick analysis on the small recently collected data if required.
The same is done extensively on the 2nd Notebook, as we will be analyzing much bigger dataset ( currently 5000 rows ), collected from our database.
There are many methods to load a DataFrame
into a database but not all perform the same way.
One should not use any loops and execute one insert query at a time unless absolute necessary, as this is highly inefficient/slow. Now for bulk inserts, there are multiple options available. As the no. of rows in our DataFrame
increases, the performance varries greatly.
Here are two great articles ( Article-1, Article-2 ) that provide detailed comparison between these methods and the code template for each method. I used one of the fastest methods here. Though fastest, this method is not highly memory efficient. Our data transformation will help us in this regard and also there is an excellent work-around mentioned in Article-2.
The psycopg2
wrapper provides connection
( doc ) and cursor
( doc ) classes to execute SQL commands, queries from the python code in a database session.
We also need to follow PostgreSQL documentation, to understand its extention to the standard SQL. Like, declaring enum
datatypes, inserting arrays as values, acceptable time-zone aware timestamp datatype formats, timedelta/interval formats etc.
After executing SQL queries, we need to commit
for the changes to take effect in the database. Also, if there occurs any error while executing SQL queries, the transaction will be aborted and all commands will be ignored until you use rollback
.
It is important to close cursor()
s after completing interactions with the database for safety reasons. Finally, the connection()
should also be closed.
Here are the description of few columns transformed from the raw data and generated at different stages of data pipeline.
The index of a video in the response of each API call, starting from 1. As I have collected 100 videos each time, the Rank is in the range 1-100.
Obtained by matching language codes from the API response data of YouTube's I18nLanguages. In some cases, the code provided by the owner is not listed in the API response though the code is valid. But as we are specifically analyzing YouTube related data, it has not been decoded going outside the defined scope.
Now for all such cases ( e.g. bihari dialects, explicitely mentioned zxx ), the code has been changed to zxx
which according to ISO639 stands for Not Applicable.
Converted from: Topics_Links β‘οΈ Topics
The API response contains the Links of wikipedia pages for specific topics. The name of the topic has been extracted from the Topics_Llinks and joined into a comma seperated string for easier insertion into database
This is the timestamp of data collection in UTC time. pandas.Timestamp.utcnow()
is executed in the same code cell where the data is fetched by API. This is necessary as it becomes the part of the Primary Key
in our database table along with another column; video_id to uniquely identify a popular video.
Inferred from: live_start_real, live_start_scheduled β‘οΈ video_type
This is the category assignment based on whether a video is/is going to be a Live Streamed content or a normal Uploaded/Posted video.
Inferred from: *** duration β‘οΈ duration_sec β‘οΈ duration_tag***
As we can see from the histogram of the duration_sec distribution, it is extreme positively skewed. So to better understand trends, this category is generated. It categorizes each video into 3 categories, namely Shorts, Normal, Long. Though YouTube Shorts
is well defined but the other 2 are not. The limiting duration is entirely based on our own experience of what the average is.
This enables us to analyze the newly added YouTube feature, #SHORTS
.
Converted from: maximum ( published_at, live_start_scheduled, live_start_real ) β‘οΈ local_publish_time
It is the UTC timestamp converted to local timestamp. This helps us to understand the peak hour of publishing new videos on YouTube that become popular. It indicates the characteristics of content creators, their preferred time to upload new contents, etc.
In the Notebook 2, under the section
Performing Data Analysis & Visualization,
all the inferences are provided as markdown
cells. Similarly, some hypothesis is proposed and I have validated the hypothesis. All related details are provided right after the analysis in the Notebook itself. Please check those out following the above link or simply navigating to the Notebook 2.
Few propositions include,
- no. of likes in a video is ~ 5% of its views
- no. of dislikes is ~ 5% of its likes
- etc.
Following are the abbreviations used in this Project.
Short Form | Meaning |
---|---|
doc | Documentation |
enum | Enumerate (Categorical) |
GCP | Google Cloud Platform |
Shorts | YouTube Shorts |
Stats | Statistics |
VM | Virtual Machine |
UTC | Coordinated Universal Time |
zxx |
No linguistic content, Not applicable |
-
πΊ The One and Only Data Science Project You Need by Nate at StrataScratch
-
πΊ An Introduction to GCP for Students by Google Cloud Tech
-
πΊ The Google Cloud Platform Free Trial and Free Tier by Google Cloud Tech
-
πΊ Deploying Free Tier (Always Free) VM in Google Cloud Platform - Snapshots, VPC Firewall and more by Brian V
-
π Pandas to PostgreSQL using Psycopg2: Bulk Insert Performance Benchmark by Naysan Saran
-
π Fastest Way to Load Data Into PostgreSQL Using Python, From two minutes to less than half a second! by Haki Benita
...and there are many more π