Skip to content

Analysis of current most popular YouTube videos. Contains real-life data collection using API, storing data in Cloud database, analysis of various data types in Pandas, and creating visualizations using Seaborn, Matplotlib.

Notifications You must be signed in to change notification settings

AniketMondal/DA_YouTube_Popular_Videos

Repository files navigation

logo YouTube Popular Videos Analysis πŸ“Š

This project tries to extract insights and patterns of YouTube's current most popular videos of a specific region (Country; here INDIA). Over 20 important attributes of each video is analyzed using Pandas, NumPy, etc. and insights are presented in vizualizations using Matplotlib and Seaborn.

This project starts with understanding the resources, methods, request parameters, structure of requested data, etc. for YouTube Data API v3. Then, for robust analysis, there is a need of Database to store data, collected at different timestamps over a long period (like 1 month or more). Here, I have explored the opportunity of using a Cloud Database, levaraging the benefits of Google Cloud Platform ( using its free-tier/always-free products only!! ). I utilized their Compute Engine as virtual machine to install and set-up the database (NOTE: The Cloud SQL of GCP is not included in the free tier).

This analysis may help anyone strategize their YouTube journey by understanding user preferences, current trends, improvement scopes etc.

Outline

Some Vizualizations

Live-Stream vs Uploaded Shorts vs Normal vs Long Correlation Heatmap Peak Hour for publishing Videos

Go back πŸ”

Required Pyhton libraries and modules

Install with your python package manager, ex.- pip, if not already installed, as these need to be imported. Like - pip install package_name.

  • datetime
  • dotenv
    pip install python-dotenv
  • io
  • matplotlib
  • numpy
  • os
  • pandas
  • pprint
  • psycopg2
  • random
  • requests
  • seaborn

Go back πŸ”

Setting-up Compute Engine on Google Cloud Platform

You first need a google billing account and a normal google account ( which can be same ). But don't worry, as Google will not charge anything if you are within limits. Also, turn off the auto payments of your billing account so that even if you cross free usage limits, you will not be charged ( your service will be terminated if you don't pay ).

Then go to Google Cloud Console by logging in with the google account and create a new project. Then go to Compute Engine and create a new VM instance. You will need to add the billing account here. While choosing the right specifications, follow this free tier usage limits to avoid any billing.

For any further guides or queries, please follow Google's Documentation.

Go back πŸ”

Setting-up PostgreSQL on Compute Engine

Please follow the detailed Google Cloud Community Tutorial contributed by Google employees to set up a PostgreSQL database in your virtual machine and configure it for remote connections.

Note: The guide uses CIDR suffix /32, which means a single IPV4 address. Which is OK for static ip addresses. But it is most likely that you have dynamic ip address and for that you need to identify right CIDR suffix by knowing the subnet mask of your network. Ex.- For 255.255.255.0 types, it should be /24.

Go back πŸ”

Creating Firewall rule for VM

Now, in order to remotely connect with VM and then the database, we need to create a firewall rule on our compute engine. The previous Tutorial includes that but here also, remember to replace the CIDR suffix with the right one as applicable.

Go back πŸ”

Enabling YouTube Data API v3

From the navigation menu of google cloud console, go to APIs & Services. Then follow the path below:

Library ▢️ YouTube Data API v3 ▢️ Enable

Now from the dashboard, you need to create Credentials for the API as the API Key is required. Note that the usage of this API is free with limitation / Quota of 1000 units per day. And our each API call costs 1 unit.

Go back πŸ”

Understanding YouTube Data API v3

One really needs to understand the different methods of the call, calling parameters and the description of output parameters to implement what is intended. For this, please refer to the Guides and Reference section of Official Documentation.

Here, we can play around with all the parameters ( check hidden parameters also! ) without spending our daily quotas. One can also see what the response will look like and verify whether the API call was right or incomplete/wrong. The response data will generally be in a nested JSON format.

After ensuring the all the parameters are set properly and the call is giving back response with status code 200, you need to choose SHOW CODE and grab the https URL, automatically generated against your specified parameters. For more info about setting parameters, and slicing parts of some parameters, please follow this guide

In this project, the region is set to IN which is the ISO 3166-1 alpha-2 code for my country India. But you can change it to any other country code as acceptable by YouTube, by simply changing the variable region_code in 1st Notebook. Also, I have collected 100 videos ( 50 videos in 2 times each ) although there could be more videos available, upto 200. The process to extract all of them using loop is explained in 1st Notebook.

Go back πŸ”

Storing all credentials as Environment Variables

So, we have to use few credentials for the entire project. These includes your API key for pulling data and database credentials for connecting to the cloud database. These are secrets and should not be published in public. Also, it is better, not to hard code these variables, following one of the 12-factor principles.

Python dotenv library provides a good solution. It reads key-value pairs from a .env file and can set them as environment variables. Then we can directly use os.getenv() to get the environment variables.

This repository contains a sample .env file which provides a template for the .env key-value pairs. Please insert the actual values and then remove the '_example' part from the name of the file.

For more info and advanced configuration, please visit here.

Go back πŸ”

Data Transformation for efficient memory usage

After normalizing JSON data using pandas.json_normalize(), we need to drop redundant columns after extracting useful information from them. Then applying appropriate datatypes we can achieve around 50% reduced memory usage per column ( as there is need for additional data wrangling which changes the no. of columns ) . This is because, pandas often store data as objects and our data is mostly in string format even for numeric columns. Also declaring proper categorical column helps a lot!

This might not be that useful in our situation but it is implemented for scalability. Also, we are using the fastest copy_from(StringIO,...) method to load data into database. But it uses high memory proportional to the memory usage of the DataFrame. So it is better to perform data transformation which also makes possible to run a quick analysis on the small recently collected data if required.

The same is done extensively on the 2nd Notebook, as we will be analyzing much bigger dataset ( currently 5000 rows ), collected from our database.

Go back πŸ”

Method used to efficiently load data into Database

There are many methods to load a DataFrame into a database but not all perform the same way. One should not use any loops and execute one insert query at a time unless absolute necessary, as this is highly inefficient/slow. Now for bulk inserts, there are multiple options available. As the no. of rows in our DataFrame increases, the performance varries greatly.

Performance Graph

Here are two great articles ( Article-1, Article-2 ) that provide detailed comparison between these methods and the code template for each method. I used one of the fastest methods here. Though fastest, this method is not highly memory efficient. Our data transformation will help us in this regard and also there is an excellent work-around mentioned in Article-2.

Go back πŸ”

Interacting with Database

The psycopg2 wrapper provides connection ( doc ) and cursor ( doc ) classes to execute SQL commands, queries from the python code in a database session.

We also need to follow PostgreSQL documentation, to understand its extention to the standard SQL. Like, declaring enum datatypes, inserting arrays as values, acceptable time-zone aware timestamp datatype formats, timedelta/interval formats etc.

Go back πŸ”

Closing Database Connections

After executing SQL queries, we need to commit for the changes to take effect in the database. Also, if there occurs any error while executing SQL queries, the transaction will be aborted and all commands will be ignored until you use rollback.

It is important to close cursor()s after completing interactions with the database for safety reasons. Finally, the connection() should also be closed.

Go back πŸ”

Transformed/Generated Columns

Here are the description of few columns transformed from the raw data and generated at different stages of data pipeline.

1. Rank

The index of a video in the response of each API call, starting from 1. As I have collected 100 videos each time, the Rank is in the range 1-100.

Go back πŸ”

2. (Title/Audio)_Language_Name

Obtained by matching language codes from the API response data of YouTube's I18nLanguages. In some cases, the code provided by the owner is not listed in the API response though the code is valid. But as we are specifically analyzing YouTube related data, it has not been decoded going outside the defined scope.

Now for all such cases ( e.g. bihari dialects, explicitely mentioned zxx ), the code has been changed to zxx which according to ISO639 stands for Not Applicable.

Go back πŸ”

3. Topics

Converted from: Topics_Links ➑️ Topics

The API response contains the Links of wikipedia pages for specific topics. The name of the topic has been extracted from the Topics_Llinks and joined into a comma seperated string for easier insertion into database

Go back πŸ”

4. Entry_Timestamp

This is the timestamp of data collection in UTC time. pandas.Timestamp.utcnow() is executed in the same code cell where the data is fetched by API. This is necessary as it becomes the part of the Primary Key in our database table along with another column; video_id to uniquely identify a popular video.

Go back πŸ”

5. video_type

Inferred from: live_start_real, live_start_scheduled ➑️ video_type

This is the category assignment based on whether a video is/is going to be a Live Streamed content or a normal Uploaded/Posted video.

Go back πŸ”

6. duration_tag

Inferred from: *** duration ➑️ duration_sec ➑️ duration_tag***

As we can see from the histogram of the duration_sec distribution, it is extreme positively skewed. So to better understand trends, this category is generated. It categorizes each video into 3 categories, namely Shorts, Normal, Long. Though YouTube Shorts is well defined but the other 2 are not. The limiting duration is entirely based on our own experience of what the average is.

This enables us to analyze the newly added YouTube feature, #SHORTS.

Go back πŸ”

7. local_publish_time

Converted from: maximum ( published_at, live_start_scheduled, live_start_real ) ➑️ local_publish_time

It is the UTC timestamp converted to local timestamp. This helps us to understand the peak hour of publishing new videos on YouTube that become popular. It indicates the characteristics of content creators, their preferred time to upload new contents, etc.

Go back πŸ”

Inference, Hypothesis, Validation from Analysis

In the Notebook 2, under the section Performing Data Analysis & Visualization, all the inferences are provided as markdown cells. Similarly, some hypothesis is proposed and I have validated the hypothesis. All related details are provided right after the analysis in the Notebook itself. Please check those out following the above link or simply navigating to the Notebook 2.

Few propositions include,

  • no. of likes in a video is ~ 5% of its views
  • no. of dislikes is ~ 5% of its likes
  • etc.

Go back πŸ”

Abbreviations Used

Following are the abbreviations used in this Project.

Short Form Meaning
doc Documentation
enum Enumerate (Categorical)
GCP Google Cloud Platform
Shorts YouTube Shorts
Stats Statistics
VM Virtual Machine
UTC Coordinated Universal Time
zxx No linguistic content, Not applicable

Go back πŸ”

Acknowledgement

...and there are many more πŸ™‚

Go back πŸ”

About

Analysis of current most popular YouTube videos. Contains real-life data collection using API, storing data in Cloud database, analysis of various data types in Pandas, and creating visualizations using Seaborn, Matplotlib.

Topics

Resources

Stars

Watchers

Forks