Project 3: Predicting online news popularity

Rohan Prabhune, Manan Shah

Introduction
Loading the necessary packages
Reading the data:
Pre-processing:
- Subset data based on type of article
- Spliting the data into train and test set
Summarizations
Modeling
- Linear regression
  - Forward Selection
  - Backward Selection
- Ensemble Tree-Based Models
  - Random Forest
  - Boosting
Model evaluation on test set

Introduction

The data set describes the features (token details, keywords, day of the week etc) for each record and a particular channel. The purpose of the analysis is to predict the performance of each individual channel. The channel performance is measured by the number of shares which in modelling context is the response variable. The url for each record and time delta which is days between article publication are removed as they are not useful in predicting shares. There are few variables such as n_non_stop_words, kw_min_max, self_reference_max_shares, rate_positive_words, n_unique_tokens, global_rate_negative_words, kw_min_min,kw_avg_min, self_reference_avg_sharess which are highly correlated in all six channel are removed before performing applying any modelling technique. Finally, four modelling methodologies which are forward subset linear regression, backward subset linear regression, and ensemble technique such as random forest and boosting are fitted on the training dataset. To evaluate the model fit, the testing is performed on the test dataset and RMSE value is calculated. These RMSE value is compared for all four models and the one with lowest RMSE value is declared as a winner for a particular channel.

Loading the necessary packages

library(tidyverse)
library(caret)
library(kableExtra)
library(corrplot)
library(timereg)

Reading the data:

Citation for the dataset:
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

# Reading the data
dataset <- read_csv("data/OnlineNewsPopularity.csv",show_col_types=FALSE) 
dataset

Pre-processing:

Subset data based on type of article

Here, remaining_cols is used to store the column names other than the data_channel_is_ columns. These will be appended to the column selected for the type of channel. We get the appended result in subset_cols. The dataset is then filtered on the columns in subset_cols and where the data_channel_is_* column values are 1.

# Vector of columns other than data_channel_is*
remaining_cols = c(names(dataset))[c(1:13,20:61)]
# Subset based on type of article
input_data_channel = params$var
select_var = paste0("data_channel_is_",input_data_channel)
subset_cols = append(remaining_cols,select_var)
# Subset data
df <- dataset %>% select(all_of(subset_cols)) %>%
  filter(dataset[select_var] == 1) %>%
  select(all_of(remaining_cols))

Removing url and timedelta as they are non-predictive variables. Filter the tibble based on n_tokens_content and n_tokens_title that is, where number of words in the content and title are not zero.

# Remove url and timedelta and filter on data
df <- df %>% select(-url,-timedelta) %>% filter(n_tokens_content!=0,n_tokens_title!=0)
df
# Column names
names(df)

##  [1] "n_tokens_title"               "n_tokens_content"            
##  [3] "n_unique_tokens"              "n_non_stop_words"            
##  [5] "n_non_stop_unique_tokens"     "num_hrefs"                   
##  [7] "num_self_hrefs"               "num_imgs"                    
##  [9] "num_videos"                   "average_token_length"        
## [11] "num_keywords"                 "kw_min_min"                  
## [13] "kw_max_min"                   "kw_avg_min"                  
## [15] "kw_min_max"                   "kw_max_max"                  
## [17] "kw_avg_max"                   "kw_min_avg"                  
## [19] "kw_max_avg"                   "kw_avg_avg"                  
## [21] "self_reference_min_shares"    "self_reference_max_shares"   
## [23] "self_reference_avg_sharess"   "weekday_is_monday"           
## [25] "weekday_is_tuesday"           "weekday_is_wednesday"        
## [27] "weekday_is_thursday"          "weekday_is_friday"           
## [29] "weekday_is_saturday"          "weekday_is_sunday"           
## [31] "is_weekend"                   "LDA_00"                      
## [33] "LDA_01"                       "LDA_02"                      
## [35] "LDA_03"                       "LDA_04"                      
## [37] "global_subjectivity"          "global_sentiment_polarity"   
## [39] "global_rate_positive_words"   "global_rate_negative_words"  
## [41] "rate_positive_words"          "rate_negative_words"         
## [43] "avg_positive_polarity"        "min_positive_polarity"       
## [45] "max_positive_polarity"        "avg_negative_polarity"       
## [47] "min_negative_polarity"        "max_negative_polarity"       
## [49] "title_subjectivity"           "title_sentiment_polarity"    
## [51] "abs_title_subjectivity"       "abs_title_sentiment_polarity"
## [53] "shares"

Spliting the data into train and test set

The entire data set is divided into training and testing data set. This is done in order to perform exploratory data analysis and modelling on train data and later evaluate the performance on the unseen test data.

set.seed(52)
# Get the indexes for training data
train_size <- sample(nrow(df), nrow(df)*0.7)
# Get training data
train_df <- df[train_size,]
# Get test data
test_df <- df[-train_size,]

Summarizations

Summary tables

Here, the column wise summary of important variables such as number of images, number of videos, number of unique tokens etc. is generated for the train data. This helps us evaluate the distribution of the variables.

# Creating summaries for some numerical variables
df_summary <- train_df %>%
  select(num_imgs,num_videos,num_hrefs,n_unique_tokens, num_videos,num_keywords,
         global_rate_positive_words,global_rate_negative_words)
predictor_table <- apply(df_summary, MARGIN = 2,FUN = summary, na.rm = TRUE)
predictor_table %>%
  kbl(caption="Summary table for predictor variables") %>%
  kable_classic(full_width = F)

Summary table for predictor variables

	num_imgs	num_videos	num_hrefs	n_unique_tokens	num_keywords	global_rate_positive_words	global_rate_negative_words
Min.	0.000000	0.000000	0.000000	0.2609953	2.000000	0.0000000	0.0000000
1st Qu.	1.000000	0.000000	4.000000	0.4785764	5.000000	0.0327766	0.0090909
Median	1.000000	0.000000	7.000000	0.5471194	6.000000	0.0428571	0.0139860
Mean	1.826535	0.673923	9.430339	0.5475176	6.484418	0.0436086	0.0147766
3rd Qu.	1.000000	0.000000	12.000000	0.6115312	8.000000	0.0538922	0.0193705
Max.	51.000000	75.000000	97.000000	0.8732394	10.000000	0.1250000	0.0588235

This table helps us understand the numerical summary for the response variable. From this we can find out minimum, maximum, median and quantile values for the response variable.

response_table <- as.array(summary(train_df$shares))
response_table %>%
  kbl(caption="Summary table for response variable") %>%
  kable_classic(full_width = F)

Summary table for response variable

Var1	Freq
Min.	1.000
1st Qu.	954.000
Median	1400.000
Mean	3245.892
3rd Qu.	2500.000
Max.	690400.000

Contingency tables

The training data already has dummy variables generated. The variable weekday_ is_* is grouped into day and a two way contingency table of day vs number of keywords is generated. The number of images is converted into categorical variable from a quantitative variable and a two way contigency table of day vs image grouped is created.

# Create a categorical "day" column
train_df$day <- ifelse(train_df$weekday_is_monday == 1, "Mon",
                ifelse(train_df$weekday_is_tuesday == 1, "Tues",
                ifelse(train_df$weekday_is_wednesday == 1, "Wed", 
                ifelse(train_df$weekday_is_thursday == 1, "Thurs",
                ifelse(train_df$weekday_is_friday == 1, "Fri",
                ifelse(train_df$weekday_is_saturday == 1, "Sat",
                ifelse(train_df$weekday_is_sunday == 1, "Sun","NA")))))))
train_df$day <- ordered(as.factor(train_df$day),
                        levels = c("Mon","Tues","Wed","Thurs","Fri","Sat","Sun"))

# Create a categorical "image_grouped" column
train_df$image_grouped <- ifelse(train_df$num_imgs %in% c(0:25), "1-25",
                          ifelse(train_df$num_imgs %in% c(26:50), "26-50",
                          ifelse(train_df$num_imgs %in% c(51:75), "51-75",
                          ifelse(train_df$num_imgs %in% c(76:101), "76-100","100+"))))
train_df$image_grouped <- ordered(as.factor(train_df$image_grouped),
                        levels = c("1-25","26-50","51-75","76-100","100+"))

# Contingency table between day and image_grouped
contingency_1 <- table(train_df$day,train_df$image_grouped)
contingency_1 %>%
  kbl(caption="Table for Days and Number of Images") %>%
  kable_classic(full_width = F)

Table for Days and Number of Images

	1-25	26-50	51-75
Mon	796	5	0
Tues	816	4	0
Wed	871	3	0
Thurs	875	2	0
Fri	582	2	0
Sat	169	1	0
Sun	237	0	1

# Contingency table between day and num_keywords
contingency_2 <- table(as.factor(train_df$day), as.factor(train_df$num_keywords))
contingency_2 %>%
  kbl(caption="Table for Days and Number of Keywords") %>%
  kable_classic(full_width = F)

Table for Days and Number of Keywords

	2	3	4	5	6	7	8	9	10
Mon	2	22	91	161	180	134	80	57	74
Tues	4	22	107	155	168	113	98	65	88
Wed	3	36	111	177	181	122	92	62	90
Thurs	2	48	112	172	160	123	94	56	110
Fri	1	18	68	117	102	83	74	54	67
Sat	0	4	13	34	34	27	16	19	23
Sun	0	19	13	38	51	36	21	22	38

Plots

Response variable analysis

The histogram below helps in understanding the distribution of the response variable. If the histogram is centered around the mean, then we can say that the distribution is normal. However if it is to the left or the right, then we can infer that the distribution is skewed in which case we can consider log of the response variable to estimate the linear model.

ggplot(train_df,aes(shares)) + 
  geom_histogram(fill='darkred')

# Creating categorical form of response variable
train_df$popularity <-qcut(train_df$shares,
                           cuts=5,
                           label=c('Low','Average','Good','High','Very High'))

Number of shares per day

A bar plot of number of shares vs Days of the week is generated. This graph helps us conclude that for which particular days, there is a high number of shares for a particular channel.

data_plot_1 <- train_df %>% 
  select(day, shares) %>% 
  group_by(day) %>% 
  summarise(Num_Of_Shares = sum(shares))

ggplot(data = data_plot_1, aes(day, Num_Of_Shares)) + 
  labs(x="Days of Week",y="Number OF Shares",title="Number of shares per day") + 
  geom_col(fill="steelblue",width=0.5)+
  theme(plot.title = element_text(hjust = 0.5))

Number of shares vs number of keywords in metadata

Here, a graph of number of keywords vs number of shares is generated. This plot provides a conclusion whether increasing the number of keywords in metadata increases the number of shares or not.

data_plot_2 <- train_df %>% 
  select(shares, num_keywords,popularity)

ggplot(data_plot_2, aes(num_keywords, shares)) + 
  geom_point() + 
  geom_jitter(aes(col = shares))+
  labs(x="Number of keywords in metadata",y="Number OF Shares")

Average number of shares per words in title

Here, a plot of average number of shares vs number of words in the title is generated. Title is an important part and it is key to have adequate number and key words in the title. The plot helps us conclude whether the number of words in the title drives the number of shares.

data_plot_3 <- train_df %>% 
  select(n_tokens_title, shares) %>% 
  group_by(n_tokens_title) %>% 
  summarise(mean_token_title = mean(shares))

ggplot(data_plot_3, aes(n_tokens_title, mean_token_title)) + 
  geom_line() +
  labs(x="Number of words in the title",y="Average number of shares")

Sentiment plots

These plots attempt to find the trend of log of shares as a function of average positive and average negative polarity. If we see that the number of shares increases or decrease with the increase in the polarity of the content, then we can infer the trend from the slope of the linear regression line on the plots given below.

ggplot(train_df, aes(avg_positive_polarity,log(shares)))+ 
  geom_point(aes(color=popularity)) + 
  geom_smooth(method="lm",color='black')+
  labs(x="Average positive polarity",y="Log Number of shares")

ggplot(train_df, aes(avg_negative_polarity,log(shares)))+ 
  geom_point(aes(color=popularity)) + 
  geom_smooth(method="lm",color='black')+
  labs(x="Average negative polarity",y="Log Number of shares")

Correlation plot

This plot is used to identify the highly correlated predictor variables. A big blue dot signifies variables having high positive correlation while a big red dot signifies a high negative correlation. This plots helps in removing the highly correlated variables from the model as including them will increase the variability in prediction.

all_corr = cor(select_if(train_df, is.numeric), method = c("spearman"))
correlated_varaibles <- findCorrelation(all_corr,cutoff = 0.8,
                                        verbose=FALSE,names=TRUE,exact=TRUE)


corr_data1 <- train_df %>% select(all_of(correlated_varaibles))
corr1 = cor(corr_data1,method = c("spearman"))
corrplot(corr1,diag=FALSE)

Modeling

Variable selection

First the variables which were added as a part of exploratory data analysis were removed. Following that, the variables which are highly correlated are removed. These are the variables which are highly correlated in all the six subsets (based on the data_channel_is* values).

# Removing newly added variables for EDA
train_df2 <- train_df %>% select(-day,-image_grouped,-popularity,-is_weekend)
# Removing highly correlated variables
train_df2 <- train_df2 %>% select(-n_non_stop_words,-kw_min_max,
                                  -self_reference_max_shares,-rate_positive_words,
                                  -n_unique_tokens,-global_rate_negative_words,
                                  -kw_min_min,-kw_avg_min,
                                  -self_reference_avg_sharess)

Correlation plot after removing highly correlated variables

This plot is used to confirm if we no longer have highly correlated variables in our dataset before we fit different models on the data.

all_corr2 = cor(select_if(train_df2, is.numeric), method = c("spearman"))
correlated_varaibles2 <- findCorrelation(all_corr2,cutoff = 0.8,
                                        verbose=FALSE,names=TRUE,exact=TRUE)

if (length(correlated_varaibles2) > 1) { 
  corr_data2 <- train_df2 %>% select(all_of(correlated_varaibles2))
  corr2 = cor(corr_data2,method = c("spearman"))
  corrplot(corr2,diag=FALSE)
}else {
  print("Not enough variables above threshold value to generate a correlation plot")
}

Linear regression

The linear regression model takes the form y = b_o + b₁x₁ + b₂x₂ + e. The x’s are the independent variable called predictor and the y is dependent variable called as response. The key assumption of linear regression are: 1. The variance of the error term (e) is constant and it’s expected value is 0. 2. The error term (e) is normally distributed. 3. The predictor variables are uncorrelated with each other. Linear regression is one such machine learning model which has less flexibility but has high interpretability due to it is a highly preferred method in regulated institutions such as banks. In order to account for non linear relationship, square and cubic terms of predictor are used. Interaction terms in the form of x₁*x₂ are used to account for dependency in predicting the response. The key aspect of linear regression is to select a group of features that best describes the response. Below, two methods of feature selection are done.

Forward Selection

In forward subset selection, on top of a constant value there is addition of variable which best describes the response. That is, after a constant is determined, a variable from the available list of p variables (say x₁) is added. Next, a variable from the list of p-1 variable which along with the variable x1 (say x₂) is added. The best model among all is selected based on AIC, BIC and adjusted r-sqaure values explained in best subset selection.

set.seed(111)

tr_ctrl <- trainControl(method = "repeatedcv")
lm_forward_fit <- train(shares ~ ., data = train_df2,
                        preProcess = c("center", "scale"),
                        method = "leapForward",
                        tuneGrid = expand.grid(nvmax = seq(1,42,2)),
                        trControl = tr_ctrl)

Backward Selection

The backward stepwise selection is reverse of forward stepwise selection. In this method, all the p variables are selected. next, one variable is removed at each step without which the response variable is best calculated. The best model among all is selected based on AIC, BIC and adjusted r-squaure values.

set.seed(111)

tr_ctrl <- trainControl(method = "repeatedcv",number=3)
lm_backward_fit <- train(shares ~ ., data = train_df2,
                        preProcess = c("center", "scale"),
                        method = "leapBackward",
                        tuneGrid = expand.grid(nvmax = seq(1,42,2)),
                        trControl = tr_ctrl)

Ensemble Tree-Based Models

Random Forest

Random Forest extend the idea of Bagging, where we get bootstrapped samples, fit a regression tree to each sample and then average their predictions. In random forest, instead of using every predictor in each of the trees, we select a random subset of predictors to make sure the trees are not correlated. train_df2 consists of 41 predictor variables and using then makes the model computationally expensive. Hence we have done PCA to get linear combinations of those 41 variables that account most of the variability in the dataset. Looking at the elbow point, we decided to go with the first ten principal components for our analysis.
For Random Forest we have used cross-validation and the number of randomly selected predictor variables for each tree is p/3 (since its a regression problem).

PCA

set.seed(111)
# Get the principal components
PC <- prcomp(select(train_df2,-shares),scale = TRUE)
#Screenplot
screeplot(PC,npcs=length(PC$sdev),type = "lines")

Training

pca_train_data <- as_tibble(predict(PC,select(train_df2,-shares))) %>%
  select(PC1:PC10) %>% 
  bind_cols(select(train_df2,shares))

rfFit <- train(shares ~ ., data = pca_train_data,
               method = "rf",
               trControl = trainControl(method = "repeatedcv",number = 3),
               tuneGrid = data.frame(mtry = ncol(pca_train_data)/3))

Boosting

Similar to random forest, boosting is an ensemble machine learning method. In this method, a residual is calculated at each step and in the next step the residual is taken as a response variable. The predicted residual is multiplied with the learning rate $\alpha$. The equation takes the form y = $\bar{x}$ + $\alpha$ * (predicted residual). The alpha value is kept low and in absence of alpha, there will be overfitting. This process is repeated number of times and the response variable is predicted.

set.seed(111)
boost_fit <- train(shares ~ .,data = train_df2,
                   method = 'gbm',
                   preProcess = c("center", "scale"),
                   trControl = tr_ctrl,
                   verbose = FALSE)

Model evaluation on test set

For each of the model fitted and trained above on training data, the prediction is done on the test data set using the train model for each method. Finally, RMSE value on testing data is calculated and compared across all four models. For comparison a dataframe of modelling method against it’s RMSE value. the one with lowest RMSE value is selected as a winner.

Forward selection

# Forward Selection
lm_forward_pred <- predict(lm_forward_fit,newdata=select(test_df,-shares))
forward_rmse <- sqrt(mean((lm_forward_pred-test_df$shares)^2))

Backward Selection

# Backward Selection
lm_backward_pred <- predict(lm_backward_fit,newdata=select(test_df,-shares))
backward_rmse <- sqrt(mean((lm_backward_pred-test_df$shares)^2))

Random Forest

#Random Forest
pca_test_data <- as_tibble(predict(PC,select(test_df,-shares))) %>%
  select(PC1:PC10)

rfPred <- predict(rfFit, newdata = pca_test_data,type = "raw")
rf_rmse <- sqrt(mean((rfPred-test_df$shares)^2))

Boosted Tree

# Boosted Tree
boosted_pred <- predict(boost_fit,newdata=select(test_df,-shares))
boosted_rmse <- sqrt(mean((boosted_pred-test_df$shares)^2))

Model evaluation

model_method <- c("Forward Selection", "Backward Selection", "Random Forest", "Boosted Tree")
model_rmse <- c(forward_rmse, backward_rmse, rf_rmse, boosted_rmse)

model_result <- data.frame(model_method, model_rmse)
model_result

winner <- paste("The winner that is best model among all is ", model_result[which.min(model_result$model_rmse), 1], " as it has lowest RMSE value of ", round(min(model_result$model_rmse), 2))
winner

## [1] "The winner that is best model among all is  Backward Selection  as it has lowest RMSE value of  5729.99"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Business_analysis.md

Business_analysis.md

Project 3: Predicting online news popularity

Introduction

Loading the necessary packages

Reading the data:

Pre-processing:

Subset data based on type of article

Spliting the data into train and test set

Summarizations

Summary tables

Contingency tables

Plots

Response variable analysis

Number of shares per day

Number of shares vs number of keywords in metadata

Average number of shares per words in title

Sentiment plots

Correlation plot

Modeling

Variable selection

Correlation plot after removing highly correlated variables

Linear regression

Forward Selection

Backward Selection

Ensemble Tree-Based Models

Random Forest

PCA

Training

Boosting

Model evaluation on test set

Forward selection

Backward Selection

Random Forest

Boosted Tree

Model evaluation

Files

Business_analysis.md

Latest commit

History

Business_analysis.md

File metadata and controls

Project 3: Predicting online news popularity

Introduction

Loading the necessary packages

Reading the data:

Pre-processing:

Subset data based on type of article

Spliting the data into train and test set

Summarizations

Summary tables

Contingency tables

Plots

Response variable analysis

Number of shares per day

Number of shares vs number of keywords in metadata

Average number of shares per words in title

Sentiment plots

Correlation plot

Modeling

Variable selection

Correlation plot after removing highly correlated variables

Linear regression

Forward Selection

Backward Selection

Ensemble Tree-Based Models

Random Forest

PCA

Training

Boosting

Model evaluation on test set

Forward selection

Backward Selection

Random Forest

Boosted Tree

Model evaluation