-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usecase clustering #90
base: gh-pages
Are you sure you want to change the base?
Conversation
src/usecase_clustering.Rmd
Outdated
An overview over all learners can be found [here](http://mlr-org.github.io/mlr-tutorial/devel/html/integrated_learners/index.html). You can also call the \texttt{listLearners} command for our specific task. | ||
|
||
|
||
```{r, warning=FALSE, eval = FALSE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can leave warning=FALSE
out, because travis has all packages and no warning will be produced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, looks very good already.
I just skimmed over it very quickly and commented on some technical stuff.
src/usecase_clustering.Rmd
Outdated
set.seed(1234) | ||
``` | ||
|
||
This is a use case for clustering with the [%mlr] package. We consider the [agriculture](https://www.rdocumentation.org/packages/cluster/versions/1.10.0/topics/agriculture) dataset that contains observations about $n=12$ countries including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use [agriculture](&cluster::agriculture)
.
(The build script for the tutorial will expand this to the correct link.)
src/usecase_clustering.Rmd
Outdated
|
||
```{r, fig.width = 5} | ||
library("cluster") | ||
data(agriculture) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use data(agriculture, package "cluster")
src/usecase_clustering.Rmd
Outdated
|
||
So let's have a look at the data first. | ||
|
||
```{r, fig.width = 5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify the aspect ratio (fig.asp
) instead of the fig.width
.
(This works better for the pdf version of the tutorial.)
src/usecase_clustering.Rmd
Outdated
|
||
* define the learning task ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/task/index.html)), | ||
* select a learning method ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/learner/index.html)), | ||
* train the learner with data ([here](http://mlr-org.github.io/mlr-tutorial/devel/html/train/index.html)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "train the learner" is sufficient.
src/usecase_clustering.Rmd
Outdated
We now have to define a clustering task. Notice that a clustering task doesn't have a target variable. | ||
|
||
```{r message = FALSE} | ||
library(mlr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need library(mlr)
and then can also leave out the message = FALSE
option.
src/usecase_clustering.Rmd
Outdated
|
||
Tuning will address the question of choosing the best hyperparameters for our problem. | ||
|
||
We first create a search space for the number of clusters $k$, e. g. $k \in \lbrace 2, 3, 4, 5 \rbrace$. Further we define an optimization algorithm and a [resampling strategy](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above you need to link to resample.md
.
src/usecase_clustering.Rmd
Outdated
|
||
We first create a search space for the number of clusters $k$, e. g. $k \in \lbrace 2, 3, 4, 5 \rbrace$. Further we define an optimization algorithm and a [resampling strategy](http://mlr-org.github.io/mlr-tutorial/devel/html/resample/index.html). | ||
|
||
Finally, by combining all the previous pieces, we can tune the parameter $k$ by calling \texttt{tuneParams}. We will use discrete_ps with grid search and the silhouette coefficient as optimization criterion: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[&tuneParams]
- I would also mention 3-fold cross-validation.
src/usecase_clustering.Rmd
Outdated
discrete_ps = makeParamSet(makeDiscreteParam("centers", values = c(2, 3, 4, 5))) | ||
ctrl = makeTuneControlGrid() | ||
res = tuneParams(cluster.lrn, agri.task, measures = silhouette, resampling = cv3, | ||
par.set = discrete_ps, control = ctrl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please indent code by 2 spaces?
src/usecase_clustering.Rmd
Outdated
|
||
This is our final clustering for our problem. | ||
|
||
```{r, fig.width= 5} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use fig.asp
.
src/usecase_clustering.Rmd
Outdated
This is our final clustering for our problem. | ||
|
||
```{r, fig.width= 5} | ||
plot(y ~ x, col = tuned.pred$data$response, data = agriculture) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use the getter function (I think getPredictionResponse
should work here)?
head(data) | ||
``` | ||
|
||
Our aim - as mentioned before - is to predict which kind of people would have survided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
survided
typo
|
||
#### Preprocessing | ||
|
||
The data set is corrected regarding their data types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do str(data)
to show the different types, then mention how and why they need corrected
@juliambr Do you still have motivation to finish this up here? Would be great! 🎉 |
wrote a usecase for clustering - please review it :)