-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalysis_cpu_PCA.Rmd
300 lines (217 loc) · 10.1 KB
/
analysis_cpu_PCA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "Modelling of CPU usage through PCA analysis"
author: "Álvaro Francesc Budría Fernández"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```
## Introduction
This document presents my analysis of a dataset containing data about a computer's usage. The final aim is to model the usage of a computer's CPU through a regression model. There are 8192 observations with 21 features plus a target.
The approach taken here is to obtain the principal components of the data through PCA, and then use a few of these components to train a simple model. Several models are compared through a 10fold CV procedure.
```{r}
library(TunePareto) # for generateCVRuns()
library(glmnet) # ridge regression
library(RSNNS) # MLP, RBFNN
```
## Read data
```{r}
X <- read.csv('cpu.csv')
# source: https://www.openml.org/d/197
```
## Check for missing values, anomalies, possible errors...
```{r}
sum(is.na(X))
summary(X)
rbind(apply(X, 2, mean), apply(X, 2, sd))
#for (i in 1:22) # commented because output is huge
#boxplot(X[,i], main=colnames(X)[i])
```
No missing values codified as NA. There are no other common codifications for NAs, such as -1, -99... so we conclude that there are no missing data in the dataset.
The boxplots allow us to detect many univariate outliers.
We can see quite a lot of outliers for all the variables. If we look further into the observations that are showing an outlier for a particular variable, we can see that those observations are not necessarily outliers for other variables. Thus, it is not wise to remove observations with at least one outlier variable, as this would result in too many lost data.
To get a better grasp of how our data looks like, it is useful to project it into a 2D space with a multidimensional scaling (MDS).
Since computing the MDS for the whole dataset takes too long in a regular computer, we draw an iid sample with 20% of the data, which will preserve the statistical properties of the data.
```{r}
set.seed(12345)
sample <- X[sample(nrow(X),size=as.integer(0.2*nrow(X)),replace=FALSE),]
distances <- dist(sample, method = "euclidean")
matdist <- as.matrix(distances)
matdist[1:5,1:5]
```
```{r}
hist(matdist[lower.tri(matdist)])
```
Plot the coordinates in 2D space.
```{r}
if (TRUE) {
mds.out <- cmdscale(matdist, eig = TRUE, k = 2)
coords <- mds.out$points
plot(coords, asp=1, cex=0.5)
}
```
We can see that coordinate y behaves very linearly with respect to coordinate x. Moreover, there seem two be three differentiated groups, one much bigger than the other.
## Split data into training (70%) and testing (30%) sets
```{r}
set.seed(12345)
N <- nrow(X)
train <- sample(1:N, round(2*N/3))
ntrain <- length(train)
ntest <- N - ntrain
Xtrain <- X[train,]
pca_train <- prcomp(Xtrain[,1:21])$x[,1:3]
df_train <- as.data.frame(cbind(pca_train, Xtrain[,22]))
Xtest <- X[-train,]
pca_test <- prcomp(Xtest[,1:21])$x[,1:3]
df_test <- as.data.frame(cbind(pca_test, Xtest[,22]))
```
## Extract principal components
With PCA, we can visualize the data and also extract features that concentrate most of the variability.
It is important to extract principal components not from the whole dataset, but from the training set. This is because PCA is computed taking into account the variability of the data. If we computed PCA with all the data, there would be a data leakage from the testing set to the training set.
```{r}
out <- prcomp(Xtrain[,1:21]) # leave the 22 feature out, as it's the target
cumsum(out$sdev)/sum(out$sdev)
plot(out$x[,1], out$x[,2], asp=1, cex=0.5)
```
PCA allows to concentrate 99.46% of the variability in the data with just three principal components.
Moreover, the biplot clearly indicates two different groups that are linearly separable. We might be interested in separating them out first and then run two separate regressions, one on each group. But first we should try if a single regression model is good enough.
We now proceed to test various regression models:
* Linear regression with binomial link function
* Linear regression with Gaussian link function
* Ridge regression
* LASSO regression
* MLP neural network
* RBF neural network
\newpage
## Cross-Validation to Choose the Binomial's Link Function
```{r}
set.seed(12345)
k <- 10
CV.folds <- generateCVRuns (df_train$V4, ntimes=1, nfold=k)
cv.results <- matrix (rep(0,5*k),nrow=k)
colnames (cv.results) <- c("k","fold", "CV error|logit",
"CV error|probit", "CV error|cloglog")
cv.results[,"CV error|logit"] <- 0
cv.results[,"CV error|probit"] <- 0
cv.results[,"CV error|cloglog"] <- 0
cv.results[,"k"] <- k
for (j in 1:k)
{
# get validation data
te <- unlist(CV.folds[[1]][[j]])
# train on TR data
mod_logit <- glm(V4/100 ~ ., family=binomial(link=logit), data=df_train[-te,])
mod_probit <- glm(V4/100 ~ ., family=binomial(link=probit), data=df_train[-te,])
mod_cloglog <- glm(V4/100 ~ ., family=binomial(link=cloglog), data=df_train[-te,])
# predict TE data
pred_logit <- predict(mod_logit, newdata=df_train[te,-4], ty="response")
pred_probit <- predict(mod_probit, newdata=df_train[te,-4], ty="response")
pred_cloglog <- predict(mod_cloglog, newdata=df_train[te,-4], ty="response")
# record validation error for this fold
n <- nrow(df_train[te,])
cv.results[j,"CV error|logit"] <- sum((df_train[te,]$V4-pred_logit*100)^2) / n
cv.results[j,"CV error|probit"] <- sum((df_train[te,]$V4-pred_probit*100)^2) / n
cv.results[j,"CV error|cloglog"] <- sum((df_train[te,]$V4-pred_cloglog*100)^2) / n
cv.results[j,"fold"] <- j
}
colMeans(cv.results[, 3:5])
```
The logit link outperforms the rest.
Ridge regression: We need to select a lambda regularization parameter for the model.
```{r}
set.seed(12345)
# reference: https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net
# Data will be standardized by the modelling function
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(as.matrix(df_train[,-4]), df_train[,4], alpha = 0,
standardize = TRUE, nfolds = 10)
#get best lambda
(lambda_cv <- ridge_cv$lambda.min)
# We are going be scaling training and testing data separately,
# because we don't want training data to influence testing data
# (recall that when scaling we substract the mean and divide by the std).
```
LASSO regression: Again, we need to select the mu regularization parameter.
```{r}
set.seed(12345)
# Setting alpha = 1 implements LASSO regression
lasso_cv <- cv.glmnet(as.matrix(df_train[,-4]), df_train[,4], alpha=1,
standardize=TRUE, nfolds=10)
#get best mu
(mu_cv <- lasso_cv$lambda.min)
```
RBF Neural Network: We select the number of centroids for the RBFNN.
```{r}
M <- floor(nrow(df_train)^(1/3)) # Number of centroids for the RBFNN
```
## Cross-Validation OF Binomial, Gaussian, Ridge, LASSO, MLPNN and RBFNN models
```{r}
set.seed(12345)
k <- 10
CV.folds <- generateCVRuns(df_train$V4, ntimes=1, nfold=k)
cv.results <- matrix (rep(0,8*k),nrow=k)
colnames (cv.results) <- c("k","fold", "CV error|Bin",
"CV error|Gaussian", "CV error|Ridge",
"CV error|LASSO", "CV error|nnet",
"CV error|rbf")
cv.results[,"CV error|Bin"] <- 0
cv.results[,"CV error|Gaussian"] <- 0
cv.results[,"CV error|Ridge"] <- 0
cv.results[,"CV error|LASSO"] <- 0
cv.results[,"CV error|nnet"] <- 0
cv.results[,"CV error|rbf"] <- 0
cv.results[,"k"] <- k
for (j in 1:k)
{
# get validation data
te <- unlist(CV.folds[[1]][[j]])
# train on TR data
mod_binomial <- glm(V4/100 ~ ., family=binomial(link=logit), data=df_train[-te,])
mod_Gaussian <- glm(V4/100 ~ ., data=df_train[-te,])
ridge <- glmnet(as.matrix(df_train[-te,-4]), df_train[-te,4], alpha=0,
lambda=lambda_cv, standardize=TRUE)
lasso <- glmnet(as.matrix(df_train[-te,-4]), df_train[-te,4], alpha=1,
lambda=mu_cv, standardize=TRUE)
my_nnet <- mlp(df_train[-te,-4], df_train[-te,4], size=c(5, 5), maxit=100,
hiddenActFunc="Act_Logistic", linOut=TRUE)
my_rbf <- rbf(df_train[-te,-4], df_train[-te,4], size=c(M), maxit=100,
initFunc="RBF_Weights", linOut=TRUE)
# predict TE data
pred_binomial <- predict(mod_binomial, newdata=df_train[te,-4], ty="response")
pred_Gaussian <- predict(mod_Gaussian, newdata=df_train[te,-4])
y_ridge <- predict(ridge, as.matrix(df_train[te,-4]))
y_lasso <- predict(lasso, as.matrix(df_train[te,-4]))
pred_nnet <- predict(my_nnet, df_train[te,-4])
pred_rbf <- predict(my_rbf, df_train[te,-4])
# record validation error for this fold
n <- nrow(df_train[te,])
cv.results[j,"CV error|Bin"] <- sum((df_train[te,4]-pred_binomial*100)^2) / n
cv.results[j,"CV error|Gaussian"] <- sum((df_train[te,4]-pred_Gaussian*100)^2) / n
cv.results[j,"CV error|Ridge"] <- (t(df_train[te,4] - y_ridge) %*% (df_train[te,4] - y_ridge)) / n
cv.results[j,"CV error|LASSO"] <- (t(df_train[te,4] - y_lasso) %*% (df_train[te,4] - y_lasso)) / n
cv.results[j,"CV error|nnet"] <- sum((df_train[te,4] - pred_nnet)^2) / n
cv.results[j,"CV error|rbf"] <- sum((df_train[te,4] - pred_rbf)^2) / n
cv.results[j,"fold"] <- j
}
colMeans(cv.results[, 3:8])
```
This time it seems that linear models outperform non-linear models! We expected that, as the PCA biplot cleary showed a markedly linear trend. The best model is the glm with logit link.
## Compute Testing Error and Confinence Interval
### Compute Testing Error
```{r}
final_mod <- glm(V4/100 ~ ., family=binomial(link=logit), data=df_train)
pred <- predict(final_mod, newdata=df_train[,-4], ty="response")
Mp <- sum((df_train[te,4]-pred_binomial*100)^2)
N <- nrow(df_train)
(NRMSE <- sqrt(Mp / ((N-1)*var(df_train[,4]))))
(R2 <- 1-NRMSE^2) # R^2
```
## Confidence Interval for the Determination Coefficient $R^2$
Source: https://stats.stackexchange.com/questions/175026/formula-for-95-confidence-interval-for-r2
Signification level: 5%
```{r}
n_coeffs <- 4 # number of predictors of our model
SE <- sqrt( (4*R2*(1-R2)^2*(N-n_coeffs-1)^2) / ((N^2-1)*(3+N)) )
(int_conf <- c(R2 - 2*SE, R2 + 2*SE))
```
This is a very tight interval, indicating that our model will generalize well.