Rohan Prabhune, Manan Shah
- Introduction
- Loading the necessary packages
- Reading the data:
- Pre-processing:
- Summarizations
- Modeling
- Model evaluation on test set
The data set describes the features (token details, keywords, day of the week etc) for each record and a particular channel. The purpose of the analysis is to predict the performance of each individual channel. The channel performance is measured by the number of shares which in modelling context is the response variable. The url for each record and time delta which is days between article publication are removed as they are not useful in predicting shares. There are few variables such as n_non_stop_words, kw_min_max, self_reference_max_shares, rate_positive_words, n_unique_tokens, global_rate_negative_words, kw_min_min,kw_avg_min, self_reference_avg_sharess which are highly correlated in all six channel are removed before performing applying any modelling technique. Finally, four modelling methodologies which are forward subset linear regression, backward subset linear regression, and ensemble technique such as random forest and boosting are fitted on the training dataset. To evaluate the model fit, the testing is performed on the test dataset and RMSE value is calculated. These RMSE value is compared for all four models and the one with lowest RMSE value is declared as a winner for a particular channel.
library(tidyverse)
library(caret)
library(kableExtra)
library(corrplot)
library(timereg)
Citation for the dataset:
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision
Support System for Predicting the Popularity of Online News. Proceedings
of the 17th EPIA 2015 - Portuguese Conference on Artificial
Intelligence, September, Coimbra, Portugal.
# Reading the data
dataset <- read_csv("data/OnlineNewsPopularity.csv",show_col_types=FALSE)
dataset
Here, remaining_cols
is used to store the column names other than the
data_channel_is_ columns. These will be appended to the column
selected for the type of channel. We get the appended result in
subset_cols
. The dataset is then filtered on the columns in
subset_cols and where the data_channel_is_* column values are 1.
# Vector of columns other than data_channel_is*
remaining_cols = c(names(dataset))[c(1:13,20:61)]
# Subset based on type of article
input_data_channel = params$var
select_var = paste0("data_channel_is_",input_data_channel)
subset_cols = append(remaining_cols,select_var)
# Subset data
df <- dataset %>% select(all_of(subset_cols)) %>%
filter(dataset[select_var] == 1) %>%
select(all_of(remaining_cols))
Removing url and timedelta as they are non-predictive variables. Filter the tibble based on n_tokens_content and n_tokens_title that is, where number of words in the content and title are not zero.
# Remove url and timedelta and filter on data
df <- df %>% select(-url,-timedelta) %>% filter(n_tokens_content!=0,n_tokens_title!=0)
df
# Column names
names(df)
## [1] "n_tokens_title" "n_tokens_content"
## [3] "n_unique_tokens" "n_non_stop_words"
## [5] "n_non_stop_unique_tokens" "num_hrefs"
## [7] "num_self_hrefs" "num_imgs"
## [9] "num_videos" "average_token_length"
## [11] "num_keywords" "kw_min_min"
## [13] "kw_max_min" "kw_avg_min"
## [15] "kw_min_max" "kw_max_max"
## [17] "kw_avg_max" "kw_min_avg"
## [19] "kw_max_avg" "kw_avg_avg"
## [21] "self_reference_min_shares" "self_reference_max_shares"
## [23] "self_reference_avg_sharess" "weekday_is_monday"
## [25] "weekday_is_tuesday" "weekday_is_wednesday"
## [27] "weekday_is_thursday" "weekday_is_friday"
## [29] "weekday_is_saturday" "weekday_is_sunday"
## [31] "is_weekend" "LDA_00"
## [33] "LDA_01" "LDA_02"
## [35] "LDA_03" "LDA_04"
## [37] "global_subjectivity" "global_sentiment_polarity"
## [39] "global_rate_positive_words" "global_rate_negative_words"
## [41] "rate_positive_words" "rate_negative_words"
## [43] "avg_positive_polarity" "min_positive_polarity"
## [45] "max_positive_polarity" "avg_negative_polarity"
## [47] "min_negative_polarity" "max_negative_polarity"
## [49] "title_subjectivity" "title_sentiment_polarity"
## [51] "abs_title_subjectivity" "abs_title_sentiment_polarity"
## [53] "shares"
The entire data set is divided into training and testing data set. This is done in order to perform exploratory data analysis and modelling on train data and later evaluate the performance on the unseen test data.
set.seed(52)
# Get the indexes for training data
train_size <- sample(nrow(df), nrow(df)*0.7)
# Get training data
train_df <- df[train_size,]
# Get test data
test_df <- df[-train_size,]
Here, the column wise summary of important variables such as number of images, number of videos, number of unique tokens etc. is generated for the train data. This helps us evaluate the distribution of the variables.
# Creating summaries for some numerical variables
df_summary <- train_df %>%
select(num_imgs,num_videos,num_hrefs,n_unique_tokens, num_videos,num_keywords,
global_rate_positive_words,global_rate_negative_words)
predictor_table <- apply(df_summary, MARGIN = 2,FUN = summary, na.rm = TRUE)
predictor_table %>%
kbl(caption="Summary table for predictor variables") %>%
kable_classic(full_width = F)
num_imgs | num_videos | num_hrefs | n_unique_tokens | num_keywords | global_rate_positive_words | global_rate_negative_words | |
---|---|---|---|---|---|---|---|
Min. | 0.000000 | 0.000000 | 0.000000 | 0.2609953 | 2.000000 | 0.0000000 | 0.0000000 |
1st Qu. | 1.000000 | 0.000000 | 4.000000 | 0.4785764 | 5.000000 | 0.0327766 | 0.0090909 |
Median | 1.000000 | 0.000000 | 7.000000 | 0.5471194 | 6.000000 | 0.0428571 | 0.0139860 |
Mean | 1.826535 | 0.673923 | 9.430339 | 0.5475176 | 6.484418 | 0.0436086 | 0.0147766 |
3rd Qu. | 1.000000 | 0.000000 | 12.000000 | 0.6115312 | 8.000000 | 0.0538922 | 0.0193705 |
Max. | 51.000000 | 75.000000 | 97.000000 | 0.8732394 | 10.000000 | 0.1250000 | 0.0588235 |
This table helps us understand the numerical summary for the response variable. From this we can find out minimum, maximum, median and quantile values for the response variable.
response_table <- as.array(summary(train_df$shares))
response_table %>%
kbl(caption="Summary table for response variable") %>%
kable_classic(full_width = F)
Var1 | Freq |
---|---|
Min. | 1.000 |
1st Qu. | 954.000 |
Median | 1400.000 |
Mean | 3245.892 |
3rd Qu. | 2500.000 |
Max. | 690400.000 |
The training data already has dummy variables generated. The variable weekday_ is_* is grouped into day and a two way contingency table of day vs number of keywords is generated. The number of images is converted into categorical variable from a quantitative variable and a two way contigency table of day vs image grouped is created.
# Create a categorical "day" column
train_df$day <- ifelse(train_df$weekday_is_monday == 1, "Mon",
ifelse(train_df$weekday_is_tuesday == 1, "Tues",
ifelse(train_df$weekday_is_wednesday == 1, "Wed",
ifelse(train_df$weekday_is_thursday == 1, "Thurs",
ifelse(train_df$weekday_is_friday == 1, "Fri",
ifelse(train_df$weekday_is_saturday == 1, "Sat",
ifelse(train_df$weekday_is_sunday == 1, "Sun","NA")))))))
train_df$day <- ordered(as.factor(train_df$day),
levels = c("Mon","Tues","Wed","Thurs","Fri","Sat","Sun"))
# Create a categorical "image_grouped" column
train_df$image_grouped <- ifelse(train_df$num_imgs %in% c(0:25), "1-25",
ifelse(train_df$num_imgs %in% c(26:50), "26-50",
ifelse(train_df$num_imgs %in% c(51:75), "51-75",
ifelse(train_df$num_imgs %in% c(76:101), "76-100","100+"))))
train_df$image_grouped <- ordered(as.factor(train_df$image_grouped),
levels = c("1-25","26-50","51-75","76-100","100+"))
# Contingency table between day and image_grouped
contingency_1 <- table(train_df$day,train_df$image_grouped)
contingency_1 %>%
kbl(caption="Table for Days and Number of Images") %>%
kable_classic(full_width = F)
1-25 | 26-50 | 51-75 | 76-100 | 100+ | |
---|---|---|---|---|---|
Mon | 796 | 5 | 0 | 0 | 0 |
Tues | 816 | 4 | 0 | 0 | 0 |
Wed | 871 | 3 | 0 | 0 | 0 |
Thurs | 875 | 2 | 0 | 0 | 0 |
Fri | 582 | 2 | 0 | 0 | 0 |
Sat | 169 | 1 | 0 | 0 | 0 |
Sun | 237 | 0 | 1 | 0 | 0 |
# Contingency table between day and num_keywords
contingency_2 <- table(as.factor(train_df$day), as.factor(train_df$num_keywords))
contingency_2 %>%
kbl(caption="Table for Days and Number of Keywords") %>%
kable_classic(full_width = F)
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|
Mon | 2 | 22 | 91 | 161 | 180 | 134 | 80 | 57 | 74 |
Tues | 4 | 22 | 107 | 155 | 168 | 113 | 98 | 65 | 88 |
Wed | 3 | 36 | 111 | 177 | 181 | 122 | 92 | 62 | 90 |
Thurs | 2 | 48 | 112 | 172 | 160 | 123 | 94 | 56 | 110 |
Fri | 1 | 18 | 68 | 117 | 102 | 83 | 74 | 54 | 67 |
Sat | 0 | 4 | 13 | 34 | 34 | 27 | 16 | 19 | 23 |
Sun | 0 | 19 | 13 | 38 | 51 | 36 | 21 | 22 | 38 |
The histogram below helps in understanding the distribution of the response variable. If the histogram is centered around the mean, then we can say that the distribution is normal. However if it is to the left or the right, then we can infer that the distribution is skewed in which case we can consider log of the response variable to estimate the linear model.
ggplot(train_df,aes(shares)) +
geom_histogram(fill='darkred')
# Creating categorical form of response variable
train_df$popularity <-qcut(train_df$shares,
cuts=5,
label=c('Low','Average','Good','High','Very High'))
A bar plot of number of shares vs Days of the week is generated. This graph helps us conclude that for which particular days, there is a high number of shares for a particular channel.
data_plot_1 <- train_df %>%
select(day, shares) %>%
group_by(day) %>%
summarise(Num_Of_Shares = sum(shares))
ggplot(data = data_plot_1, aes(day, Num_Of_Shares)) +
labs(x="Days of Week",y="Number OF Shares",title="Number of shares per day") +
geom_col(fill="steelblue",width=0.5)+
theme(plot.title = element_text(hjust = 0.5))
Here, a graph of number of keywords vs number of shares is generated. This plot provides a conclusion whether increasing the number of keywords in metadata increases the number of shares or not.
data_plot_2 <- train_df %>%
select(shares, num_keywords,popularity)
ggplot(data_plot_2, aes(num_keywords, shares)) +
geom_point() +
geom_jitter(aes(col = shares))+
labs(x="Number of keywords in metadata",y="Number OF Shares")
Here, a plot of average number of shares vs number of words in the title is generated. Title is an important part and it is key to have adequate number and key words in the title. The plot helps us conclude whether the number of words in the title drives the number of shares.
data_plot_3 <- train_df %>%
select(n_tokens_title, shares) %>%
group_by(n_tokens_title) %>%
summarise(mean_token_title = mean(shares))
ggplot(data_plot_3, aes(n_tokens_title, mean_token_title)) +
geom_line() +
labs(x="Number of words in the title",y="Average number of shares")
These plots attempt to find the trend of log of shares as a function of average positive and average negative polarity. If we see that the number of shares increases or decrease with the increase in the polarity of the content, then we can infer the trend from the slope of the linear regression line on the plots given below.
ggplot(train_df, aes(avg_positive_polarity,log(shares)))+
geom_point(aes(color=popularity)) +
geom_smooth(method="lm",color='black')+
labs(x="Average positive polarity",y="Log Number of shares")
ggplot(train_df, aes(avg_negative_polarity,log(shares)))+
geom_point(aes(color=popularity)) +
geom_smooth(method="lm",color='black')+
labs(x="Average negative polarity",y="Log Number of shares")
This plot is used to identify the highly correlated predictor variables. A big blue dot signifies variables having high positive correlation while a big red dot signifies a high negative correlation. This plots helps in removing the highly correlated variables from the model as including them will increase the variability in prediction.
all_corr = cor(select_if(train_df, is.numeric), method = c("spearman"))
correlated_varaibles <- findCorrelation(all_corr,cutoff = 0.8,
verbose=FALSE,names=TRUE,exact=TRUE)
corr_data1 <- train_df %>% select(all_of(correlated_varaibles))
corr1 = cor(corr_data1,method = c("spearman"))
corrplot(corr1,diag=FALSE)
First the variables which were added as a part of exploratory data analysis were removed. Following that, the variables which are highly correlated are removed. These are the variables which are highly correlated in all the six subsets (based on the data_channel_is* values).
# Removing newly added variables for EDA
train_df2 <- train_df %>% select(-day,-image_grouped,-popularity,-is_weekend)
# Removing highly correlated variables
train_df2 <- train_df2 %>% select(-n_non_stop_words,-kw_min_max,
-self_reference_max_shares,-rate_positive_words,
-n_unique_tokens,-global_rate_negative_words,
-kw_min_min,-kw_avg_min,
-self_reference_avg_sharess)
This plot is used to confirm if we no longer have highly correlated variables in our dataset before we fit different models on the data.
all_corr2 = cor(select_if(train_df2, is.numeric), method = c("spearman"))
correlated_varaibles2 <- findCorrelation(all_corr2,cutoff = 0.8,
verbose=FALSE,names=TRUE,exact=TRUE)
if (length(correlated_varaibles2) > 1) {
corr_data2 <- train_df2 %>% select(all_of(correlated_varaibles2))
corr2 = cor(corr_data2,method = c("spearman"))
corrplot(corr2,diag=FALSE)
}else {
print("Not enough variables above threshold value to generate a correlation plot")
}
The linear regression model takes the form y = bo + b1x1 + b2x2 + e. The x’s are the independent variable called predictor and the y is dependent variable called as response. The key assumption of linear regression are: 1. The variance of the error term (e) is constant and it’s expected value is 0. 2. The error term (e) is normally distributed. 3. The predictor variables are uncorrelated with each other. Linear regression is one such machine learning model which has less flexibility but has high interpretability due to it is a highly preferred method in regulated institutions such as banks. In order to account for non linear relationship, square and cubic terms of predictor are used. Interaction terms in the form of x1*x2 are used to account for dependency in predicting the response. The key aspect of linear regression is to select a group of features that best describes the response. Below, two methods of feature selection are done.
In forward subset selection, on top of a constant value there is addition of variable which best describes the response. That is, after a constant is determined, a variable from the available list of p variables (say x1) is added. Next, a variable from the list of p-1 variable which along with the variable x1 (say x2) is added. The best model among all is selected based on AIC, BIC and adjusted r-sqaure values explained in best subset selection.
set.seed(111)
tr_ctrl <- trainControl(method = "repeatedcv")
lm_forward_fit <- train(shares ~ ., data = train_df2,
preProcess = c("center", "scale"),
method = "leapForward",
tuneGrid = expand.grid(nvmax = seq(1,42,2)),
trControl = tr_ctrl)
The backward stepwise selection is reverse of forward stepwise selection. In this method, all the p variables are selected. next, one variable is removed at each step without which the response variable is best calculated. The best model among all is selected based on AIC, BIC and adjusted r-squaure values.
set.seed(111)
tr_ctrl <- trainControl(method = "repeatedcv",number=3)
lm_backward_fit <- train(shares ~ ., data = train_df2,
preProcess = c("center", "scale"),
method = "leapBackward",
tuneGrid = expand.grid(nvmax = seq(1,42,2)),
trControl = tr_ctrl)
Random Forest extend the idea of Bagging, where we get bootstrapped
samples, fit a regression tree to each sample and then average their
predictions. In random forest, instead of using every predictor in each
of the trees, we select a random subset of predictors to make sure the
trees are not correlated. train_df2
consists of 41 predictor variables
and using then makes the model computationally expensive. Hence we have
done PCA to get linear combinations of those 41 variables that account
most of the variability in the dataset. Looking at the elbow point, we
decided to go with the first ten principal components for our
analysis.
For Random Forest we have used cross-validation and the number of
randomly selected predictor variables for each tree is p/3 (since its a
regression problem).
set.seed(111)
# Get the principal components
PC <- prcomp(select(train_df2,-shares),scale = TRUE)
#Screenplot
screeplot(PC,npcs=length(PC$sdev),type = "lines")
pca_train_data <- as_tibble(predict(PC,select(train_df2,-shares))) %>%
select(PC1:PC10) %>%
bind_cols(select(train_df2,shares))
rfFit <- train(shares ~ ., data = pca_train_data,
method = "rf",
trControl = trainControl(method = "repeatedcv",number = 3),
tuneGrid = data.frame(mtry = ncol(pca_train_data)/3))
Similar to random forest, boosting is an ensemble machine learning
method. In this method, a residual is calculated at each step and in the
next step the residual is taken as a response variable. The predicted
residual is multiplied with the learning rate
set.seed(111)
boost_fit <- train(shares ~ .,data = train_df2,
method = 'gbm',
preProcess = c("center", "scale"),
trControl = tr_ctrl,
verbose = FALSE)
For each of the model fitted and trained above on training data, the prediction is done on the test data set using the train model for each method. Finally, RMSE value on testing data is calculated and compared across all four models. For comparison a dataframe of modelling method against it’s RMSE value. the one with lowest RMSE value is selected as a winner.
# Forward Selection
lm_forward_pred <- predict(lm_forward_fit,newdata=select(test_df,-shares))
forward_rmse <- sqrt(mean((lm_forward_pred-test_df$shares)^2))
# Backward Selection
lm_backward_pred <- predict(lm_backward_fit,newdata=select(test_df,-shares))
backward_rmse <- sqrt(mean((lm_backward_pred-test_df$shares)^2))
#Random Forest
pca_test_data <- as_tibble(predict(PC,select(test_df,-shares))) %>%
select(PC1:PC10)
rfPred <- predict(rfFit, newdata = pca_test_data,type = "raw")
rf_rmse <- sqrt(mean((rfPred-test_df$shares)^2))
# Boosted Tree
boosted_pred <- predict(boost_fit,newdata=select(test_df,-shares))
boosted_rmse <- sqrt(mean((boosted_pred-test_df$shares)^2))
model_method <- c("Forward Selection", "Backward Selection", "Random Forest", "Boosted Tree")
model_rmse <- c(forward_rmse, backward_rmse, rf_rmse, boosted_rmse)
model_result <- data.frame(model_method, model_rmse)
model_result
winner <- paste("The winner that is best model among all is ", model_result[which.min(model_result$model_rmse), 1], " as it has lowest RMSE value of ", round(min(model_result$model_rmse), 2))
winner
## [1] "The winner that is best model among all is Backward Selection as it has lowest RMSE value of 5729.99"