Skip to content

Latest commit

 

History

History
1277 lines (1163 loc) · 32.1 KB

World_analysis.md

File metadata and controls

1277 lines (1163 loc) · 32.1 KB

Project 3: Predicting online news popularity

Rohan Prabhune, Manan Shah

Introduction

The data set describes the features (token details, keywords, day of the week etc) for each record and a particular channel. The purpose of the analysis is to predict the performance of each individual channel. The channel performance is measured by the number of shares which in modelling context is the response variable. The url for each record and time delta which is days between article publication are removed as they are not useful in predicting shares. There are few variables such as n_non_stop_words, kw_min_max, self_reference_max_shares, rate_positive_words, n_unique_tokens, global_rate_negative_words, kw_min_min,kw_avg_min, self_reference_avg_sharess which are highly correlated in all six channel are removed before performing applying any modelling technique. Finally, four modelling methodologies which are forward subset linear regression, backward subset linear regression, and ensemble technique such as random forest and boosting are fitted on the training dataset. To evaluate the model fit, the testing is performed on the test dataset and RMSE value is calculated. These RMSE value is compared for all four models and the one with lowest RMSE value is declared as a winner for a particular channel.

Loading the necessary packages

library(tidyverse)
library(caret)
library(kableExtra)
library(corrplot)
library(timereg)

Reading the data:

Citation for the dataset:
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

# Reading the data
dataset <- read_csv("data/OnlineNewsPopularity.csv",show_col_types=FALSE) 
dataset

Pre-processing:

Subset data based on type of article

Here, remaining_cols is used to store the column names other than the data_channel_is_ columns. These will be appended to the column selected for the type of channel. We get the appended result in subset_cols. The dataset is then filtered on the columns in subset_cols and where the data_channel_is_* column values are 1.

# Vector of columns other than data_channel_is*
remaining_cols = c(names(dataset))[c(1:13,20:61)]
# Subset based on type of article
input_data_channel = params$var
select_var = paste0("data_channel_is_",input_data_channel)
subset_cols = append(remaining_cols,select_var)
# Subset data
df <- dataset %>% select(all_of(subset_cols)) %>%
  filter(dataset[select_var] == 1) %>%
  select(all_of(remaining_cols))

Removing url and timedelta as they are non-predictive variables. Filter the tibble based on n_tokens_content and n_tokens_title that is, where number of words in the content and title are not zero.

# Remove url and timedelta and filter on data
df <- df %>% select(-url,-timedelta) %>% filter(n_tokens_content!=0,n_tokens_title!=0)
df
# Column names
names(df)
##  [1] "n_tokens_title"               "n_tokens_content"            
##  [3] "n_unique_tokens"              "n_non_stop_words"            
##  [5] "n_non_stop_unique_tokens"     "num_hrefs"                   
##  [7] "num_self_hrefs"               "num_imgs"                    
##  [9] "num_videos"                   "average_token_length"        
## [11] "num_keywords"                 "kw_min_min"                  
## [13] "kw_max_min"                   "kw_avg_min"                  
## [15] "kw_min_max"                   "kw_max_max"                  
## [17] "kw_avg_max"                   "kw_min_avg"                  
## [19] "kw_max_avg"                   "kw_avg_avg"                  
## [21] "self_reference_min_shares"    "self_reference_max_shares"   
## [23] "self_reference_avg_sharess"   "weekday_is_monday"           
## [25] "weekday_is_tuesday"           "weekday_is_wednesday"        
## [27] "weekday_is_thursday"          "weekday_is_friday"           
## [29] "weekday_is_saturday"          "weekday_is_sunday"           
## [31] "is_weekend"                   "LDA_00"                      
## [33] "LDA_01"                       "LDA_02"                      
## [35] "LDA_03"                       "LDA_04"                      
## [37] "global_subjectivity"          "global_sentiment_polarity"   
## [39] "global_rate_positive_words"   "global_rate_negative_words"  
## [41] "rate_positive_words"          "rate_negative_words"         
## [43] "avg_positive_polarity"        "min_positive_polarity"       
## [45] "max_positive_polarity"        "avg_negative_polarity"       
## [47] "min_negative_polarity"        "max_negative_polarity"       
## [49] "title_subjectivity"           "title_sentiment_polarity"    
## [51] "abs_title_subjectivity"       "abs_title_sentiment_polarity"
## [53] "shares"

Spliting the data into train and test set

The entire data set is divided into training and testing data set. This is done in order to perform exploratory data analysis and modelling on train data and later evaluate the performance on the unseen test data.

set.seed(52)
# Get the indexes for training data
train_size <- sample(nrow(df), nrow(df)*0.7)
# Get training data
train_df <- df[train_size,]
# Get test data
test_df <- df[-train_size,] 

Summarizations

Summary tables

Here, the column wise summary of important variables such as number of images, number of videos, number of unique tokens etc. is generated for the train data. This helps us evaluate the distribution of the variables.

# Creating summaries for some numerical variables
df_summary <- train_df %>%
  select(num_imgs,num_videos,num_hrefs,n_unique_tokens, num_videos,num_keywords,
         global_rate_positive_words,global_rate_negative_words)
predictor_table <- apply(df_summary, MARGIN = 2,FUN = summary, na.rm = TRUE)
predictor_table %>%
  kbl(caption="Summary table for predictor variables") %>%
  kable_classic(full_width = F)
Summary table for predictor variables
num_imgs num_videos num_hrefs n_unique_tokens num_keywords global_rate_positive_words global_rate_negative_words
Min. 0.00000 0.0000000 0.00000 0.1872083 2.000000 0.0000000 0.0000000
1st Qu. 1.00000 0.0000000 5.00000 0.4708171 6.000000 0.0227704 0.0114286
Median 1.00000 0.0000000 8.00000 0.5233415 7.000000 0.0310651 0.0165746
Mean 2.67413 0.5550114 10.59297 0.5266837 7.278993 0.0322662 0.0173682
3rd Qu. 2.00000 1.0000000 13.00000 0.5764331 9.000000 0.0398936 0.0222635
Max. 79.00000 51.0000000 161.00000 0.9761905 10.000000 0.1127273 0.0750361

This table helps us understand the numerical summary for the response variable. From this we can find out minimum, maximum, median and quantile values for the response variable.

response_table <- as.array(summary(train_df$shares))
response_table %>%
  kbl(caption="Summary table for response variable") %>%
  kable_classic(full_width = F)
Summary table for response variable
Var1 Freq
Min. 35.000
1st Qu. 827.000
Median 1100.000
Mean 2270.815
3rd Qu. 1800.000
Max. 284700.000

Contingency tables

The training data already has dummy variables generated. The variable weekday_ is_* is grouped into day and a two way contingency table of day vs number of keywords is generated. The number of images is converted into categorical variable from a quantitative variable and a two way contigency table of day vs image grouped is created.

# Create a categorical "day" column
train_df$day <- ifelse(train_df$weekday_is_monday == 1, "Mon",
                ifelse(train_df$weekday_is_tuesday == 1, "Tues",
                ifelse(train_df$weekday_is_wednesday == 1, "Wed", 
                ifelse(train_df$weekday_is_thursday == 1, "Thurs",
                ifelse(train_df$weekday_is_friday == 1, "Fri",
                ifelse(train_df$weekday_is_saturday == 1, "Sat",
                ifelse(train_df$weekday_is_sunday == 1, "Sun","NA")))))))
train_df$day <- ordered(as.factor(train_df$day),
                        levels = c("Mon","Tues","Wed","Thurs","Fri","Sat","Sun"))

# Create a categorical "image_grouped" column
train_df$image_grouped <- ifelse(train_df$num_imgs %in% c(0:25), "1-25",
                          ifelse(train_df$num_imgs %in% c(26:50), "26-50",
                          ifelse(train_df$num_imgs %in% c(51:75), "51-75",
                          ifelse(train_df$num_imgs %in% c(76:101), "76-100","100+"))))
train_df$image_grouped <- ordered(as.factor(train_df$image_grouped),
                        levels = c("1-25","26-50","51-75","76-100","100+"))

# Contingency table between day and image_grouped
contingency_1 <- table(train_df$day,train_df$image_grouped)
contingency_1 %>%
  kbl(caption="Table for Days and Number of Images") %>%
  kable_classic(full_width = F)
Table for Days and Number of Images
1-25 26-50 51-75 76-100 100+
Mon 872 3 1 0 0
Tues 1084 8 0 1 0
Wed 1060 11 3 0 0
Thurs 1057 5 1 0 0
Fri 877 14 4 0 0
Sat 351 2 1 0 0
Sun 359 3 0 0 0
# Contingency table between day and num_keywords
contingency_2 <- table(as.factor(train_df$day), as.factor(train_df$num_keywords))
contingency_2 %>%
  kbl(caption="Table for Days and Number of Keywords") %>%
  kable_classic(full_width = F)
Table for Days and Number of Keywords
2 3 4 5 6 7 8 9 10
Mon 0 7 40 120 170 174 119 93 153
Tues 1 14 61 118 190 200 170 143 196
Wed 0 15 68 127 180 191 172 138 183
Thurs 0 17 67 124 184 195 132 139 205
Fri 0 13 41 102 155 171 143 128 142
Sat 0 13 31 51 50 49 48 35 77
Sun 0 6 26 54 62 60 53 35 66

Plots

Response variable analysis

The histogram below helps in understanding the distribution of the response variable. If the histogram is centered around the mean, then we can say that the distribution is normal. However if it is to the left or the right, then we can infer that the distribution is skewed in which case we can consider log of the response variable to estimate the linear model.

ggplot(train_df,aes(shares)) + 
  geom_histogram(fill='darkred')

# Creating categorical form of response variable
train_df$popularity <-qcut(train_df$shares,
                           cuts=5,
                           label=c('Low','Average','Good','High','Very High'))

Number of shares per day

A bar plot of number of shares vs Days of the week is generated. This graph helps us conclude that for which particular days, there is a high number of shares for a particular channel.

data_plot_1 <- train_df %>% 
  select(day, shares) %>% 
  group_by(day) %>% 
  summarise(Num_Of_Shares = sum(shares))

ggplot(data = data_plot_1, aes(day, Num_Of_Shares)) + 
  labs(x="Days of Week",y="Number OF Shares",title="Number of shares per day") + 
  geom_col(fill="steelblue",width=0.5)+
  theme(plot.title = element_text(hjust = 0.5))

Number of shares vs number of keywords in metadata

Here, a graph of number of keywords vs number of shares is generated. This plot provides a conclusion whether increasing the number of keywords in metadata increases the number of shares or not.

data_plot_2 <- train_df %>% 
  select(shares, num_keywords,popularity)

ggplot(data_plot_2, aes(num_keywords, shares)) + 
  geom_point() + 
  geom_jitter(aes(col = shares))+
  labs(x="Number of keywords in metadata",y="Number OF Shares")

Average number of shares per words in title

Here, a plot of average number of shares vs number of words in the title is generated. Title is an important part and it is key to have adequate number and key words in the title. The plot helps us conclude whether the number of words in the title drives the number of shares.

data_plot_3 <- train_df %>% 
  select(n_tokens_title, shares) %>% 
  group_by(n_tokens_title) %>% 
  summarise(mean_token_title = mean(shares))

ggplot(data_plot_3, aes(n_tokens_title, mean_token_title)) + 
  geom_line() +
  labs(x="Number of words in the title",y="Average number of shares")

Sentiment plots

These plots attempt to find the trend of log of shares as a function of average positive and average negative polarity. If we see that the number of shares increases or decrease with the increase in the polarity of the content, then we can infer the trend from the slope of the linear regression line on the plots given below.

ggplot(train_df, aes(avg_positive_polarity,log(shares)))+ 
  geom_point(aes(color=popularity)) + 
  geom_smooth(method="lm",color='black')+
  labs(x="Average positive polarity",y="Log Number of shares")

ggplot(train_df, aes(avg_negative_polarity,log(shares)))+ 
  geom_point(aes(color=popularity)) + 
  geom_smooth(method="lm",color='black')+
  labs(x="Average negative polarity",y="Log Number of shares")

Correlation plot

This plot is used to identify the highly correlated predictor variables. A big blue dot signifies variables having high positive correlation while a big red dot signifies a high negative correlation. This plots helps in removing the highly correlated variables from the model as including them will increase the variability in prediction.

all_corr = cor(select_if(train_df, is.numeric), method = c("spearman"))
correlated_varaibles <- findCorrelation(all_corr,cutoff = 0.8,
                                        verbose=FALSE,names=TRUE,exact=TRUE)


corr_data1 <- train_df %>% select(all_of(correlated_varaibles))
corr1 = cor(corr_data1,method = c("spearman"))
corrplot(corr1,diag=FALSE)

Modeling

Variable selection

First the variables which were added as a part of exploratory data analysis were removed. Following that, the variables which are highly correlated are removed. These are the variables which are highly correlated in all the six subsets (based on the data_channel_is* values).

# Removing newly added variables for EDA
train_df2 <- train_df %>% select(-day,-image_grouped,-popularity,-is_weekend)
# Removing highly correlated variables
train_df2 <- train_df2 %>% select(-n_non_stop_words,-kw_min_max,
                                  -self_reference_max_shares,-rate_positive_words,
                                  -n_unique_tokens,-global_rate_negative_words,
                                  -kw_min_min,-kw_avg_min,
                                  -self_reference_avg_sharess)

Correlation plot after removing highly correlated variables

This plot is used to confirm if we no longer have highly correlated variables in our dataset before we fit different models on the data.

all_corr2 = cor(select_if(train_df2, is.numeric), method = c("spearman"))
correlated_varaibles2 <- findCorrelation(all_corr2,cutoff = 0.8,
                                        verbose=FALSE,names=TRUE,exact=TRUE)

if (length(correlated_varaibles2) > 1) { 
  corr_data2 <- train_df2 %>% select(all_of(correlated_varaibles2))
  corr2 = cor(corr_data2,method = c("spearman"))
  corrplot(corr2,diag=FALSE)
}else {
  print("Not enough variables above threshold value to generate a correlation plot")
}

Linear regression

The linear regression model takes the form y = bo + b1x1 + b2x2 + e. The x’s are the independent variable called predictor and the y is dependent variable called as response. The key assumption of linear regression are: 1. The variance of the error term (e) is constant and it’s expected value is 0. 2. The error term (e) is normally distributed. 3. The predictor variables are uncorrelated with each other. Linear regression is one such machine learning model which has less flexibility but has high interpretability due to it is a highly preferred method in regulated institutions such as banks. In order to account for non linear relationship, square and cubic terms of predictor are used. Interaction terms in the form of x1*x2 are used to account for dependency in predicting the response. The key aspect of linear regression is to select a group of features that best describes the response. Below, two methods of feature selection are done.

Forward Selection

In forward subset selection, on top of a constant value there is addition of variable which best describes the response. That is, after a constant is determined, a variable from the available list of p variables (say x1) is added. Next, a variable from the list of p-1 variable which along with the variable x1 (say x2) is added. The best model among all is selected based on AIC, BIC and adjusted r-sqaure values explained in best subset selection.

set.seed(111)

tr_ctrl <- trainControl(method = "repeatedcv")
lm_forward_fit <- train(shares ~ ., data = train_df2,
                        preProcess = c("center", "scale"),
                        method = "leapForward",
                        tuneGrid = expand.grid(nvmax = seq(1,42,2)),
                        trControl = tr_ctrl)

Backward Selection

The backward stepwise selection is reverse of forward stepwise selection. In this method, all the p variables are selected. next, one variable is removed at each step without which the response variable is best calculated. The best model among all is selected based on AIC, BIC and adjusted r-squaure values.

set.seed(111)

tr_ctrl <- trainControl(method = "repeatedcv",number=3)
lm_backward_fit <- train(shares ~ ., data = train_df2,
                        preProcess = c("center", "scale"),
                        method = "leapBackward",
                        tuneGrid = expand.grid(nvmax = seq(1,42,2)),
                        trControl = tr_ctrl)

Ensemble Tree-Based Models

Random Forest

Random Forest extend the idea of Bagging, where we get bootstrapped samples, fit a regression tree to each sample and then average their predictions. In random forest, instead of using every predictor in each of the trees, we select a random subset of predictors to make sure the trees are not correlated. train_df2 consists of 41 predictor variables and using then makes the model computationally expensive. Hence we have done PCA to get linear combinations of those 41 variables that account most of the variability in the dataset. Looking at the elbow point, we decided to go with the first ten principal components for our analysis.
For Random Forest we have used cross-validation and the number of randomly selected predictor variables for each tree is p/3 (since its a regression problem).

PCA

set.seed(111)
# Get the principal components
PC <- prcomp(select(train_df2,-shares),scale = TRUE)
#Screenplot
screeplot(PC,npcs=length(PC$sdev),type = "lines")

Training

pca_train_data <- as_tibble(predict(PC,select(train_df2,-shares))) %>%
  select(PC1:PC10) %>% 
  bind_cols(select(train_df2,shares))

rfFit <- train(shares ~ ., data = pca_train_data,
               method = "rf",
               trControl = trainControl(method = "repeatedcv",number = 3),
               tuneGrid = data.frame(mtry = ncol(pca_train_data)/3)) 

Boosting

Similar to random forest, boosting is an ensemble machine learning method. In this method, a residual is calculated at each step and in the next step the residual is taken as a response variable. The predicted residual is multiplied with the learning rate $\alpha$. The equation takes the form y = $\bar{x}$ + $\alpha$ * (predicted residual). The alpha value is kept low and in absence of alpha, there will be overfitting. This process is repeated number of times and the response variable is predicted.

set.seed(111)
boost_fit <- train(shares ~ .,data = train_df2,
                   method = 'gbm',
                   preProcess = c("center", "scale"),
                   trControl = tr_ctrl,
                   verbose = FALSE)

Model evaluation on test set

For each of the model fitted and trained above on training data, the prediction is done on the test data set using the train model for each method. Finally, RMSE value on testing data is calculated and compared across all four models. For comparison a dataframe of modelling method against it’s RMSE value. the one with lowest RMSE value is selected as a winner.

Forward selection

# Forward Selection
lm_forward_pred <- predict(lm_forward_fit,newdata=select(test_df,-shares))
forward_rmse <- sqrt(mean((lm_forward_pred-test_df$shares)^2))

Backward Selection

# Backward Selection
lm_backward_pred <- predict(lm_backward_fit,newdata=select(test_df,-shares))
backward_rmse <- sqrt(mean((lm_backward_pred-test_df$shares)^2))

Random Forest

#Random Forest
pca_test_data <- as_tibble(predict(PC,select(test_df,-shares))) %>%
  select(PC1:PC10)

rfPred <- predict(rfFit, newdata = pca_test_data,type = "raw")
rf_rmse <- sqrt(mean((rfPred-test_df$shares)^2))

Boosted Tree

# Boosted Tree
boosted_pred <- predict(boost_fit,newdata=select(test_df,-shares))
boosted_rmse <- sqrt(mean((boosted_pred-test_df$shares)^2))

Model evaluation

model_method <- c("Forward Selection", "Backward Selection", "Random Forest", "Boosted Tree")
model_rmse <- c(forward_rmse, backward_rmse, rf_rmse, boosted_rmse)

model_result <- data.frame(model_method, model_rmse)
model_result

winner <- paste("The winner that is best model among all is ", model_result[which.min(model_result$model_rmse), 1], " as it has lowest RMSE value of ", round(min(model_result$model_rmse), 2))
winner
## [1] "The winner that is best model among all is  Backward Selection  as it has lowest RMSE value of  4667.49"